# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [2]:
import pymysql
from sqlalchemy import create_engine


#### 3. Create a mysql engine to set the connection to the server. 
Check the connection details here [here](https://relational.fit.cvut.cz/dataset/Stats)

In [3]:
# Use following credentials:
# hostname: relational.fit.cvut.cz
# port: 3306
# username: guest
# password: relational

In [4]:
engine = create_engine("mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306/stats")

#### 4. Import the users table.

In [5]:
users = pd.read_sql_table('users',engine)

#### 5. Rename Id column to userId.

In [6]:
users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


In [7]:
users = users.rename(columns={'Id':'userId'})
users.head()

#[Tjerk:] keurig!

Unnamed: 0,userId,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


#### 6. Import the posts table. 

In [8]:
posts = pd.read_sql_table('posts',engine)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [9]:
posts.head(1)
posts = posts.rename(columns={'Id':'postId','OwnerUserId':'userId'})
#[Tjerk:] keurig!

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [10]:
users_small = users[['userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]
posts_small = posts[['postId', 'Score', 'userId', 'ViewCount', 'CommentCount']]
#[Tjerk:] keurig!

#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [11]:
combined = users_small.merge(posts_small,how='inner',on='userId')
#[Tjerk:] keurig, ook fijn dat je hoe je wilt joinen (inner) specificeert
# gebruik na een comma een spatie zoals je ook hierboven deed, dat komt vast over als een klein mierenneuk dingetje,
#maar het is straks voor je collega coders erg belangrijk dat je consistent bent. Daar dwing je ook meteen het juiste respect mee af wat je later weer kunt gebruiken


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [12]:
print(combined.isnull().sum())

# it seems that my column viewCount contains null values

#[Tjerk:]
# je wordt data analyticus, it seems is vaag, weet je het niet zeker?
# volgende keer kort bondig en duidelijk:
# column viewcount has null values PUNT.

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
postId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64


#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [13]:
import numpy as np

## I will fill them because maybe I would like to see posts by user than never had any views. 
## I am always a bit carefull with removing original info from my dataframe 
combined['ViewCount'] = np.where(combined['ViewCount'].isna() == True,0,combined['ViewCount'])



#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [None]:
print(combined.dtypes)

print("""\nI would change ViewCount from float to integer 
as ViewCount will never have half views 
""")

#[Tjerk]: je bent warm! je gebruikt 3 quotes? 
# handig! het voordeel is dan dat je een string kunt doen met enters erin zonder \n
# en toch gebruik je \n waardoor je laat zien dat je die niet helemaal begrijpt
#
# tip:
# of enkele quotes en \n gebruiken
# of triple quotes en de enter zelf schrijven
# voorbeeld:

In [2]:
print('''dit

is

makkelijk
lege 
regels
maken''')

dit

is

makkelijk
lege 
regels
maken


In [14]:
combined['ViewCount'] = combined['ViewCount'].astype(int)

print(combined.dtypes)

#[Tjerk]: goedzo! wees zo specifiek als mogelijk in je type aanduiding, gebruik dus int64
# weet je waar de 64 voor staat? zoek het op en laat me weten als je het niet begrijpt. become a pro!

# er zit hier nog een bonus opdracht bij die heb je verwijderd. maak je die nog als je daar tijd voor hebt? zo haal je het beste uit je bootcamp!

userId            int64
Reputation        int64
Views             int64
UpVotes           int64
DownVotes         int64
postId            int64
Score             int64
ViewCount       float64
CommentCount      int64
dtype: object

I would change ViewCount from float to integer 
as ViewCount will never have half views 

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
postId          int64
Score           int64
ViewCount       int64
CommentCount    int64
dtype: object
