# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [2]:
import pymysql
from sqlalchemy import create_engine


#### 3. Create a mysql engine to set the connection to the server. 
Check the connection details here [here](https://relational.fit.cvut.cz/dataset/Stats)

In [3]:
#'mysql+pymysql://[USER]:[PASSWORD]@localhost/[NAME DB]'
engine = create_engine('mysql+pymysql://guest:relational@relational.fit.cvut.cz/stats')

#### 4. Import the users table.

In [4]:
users = pd.read_sql_query('SELECT * FROM users', engine)
users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


#### 5. Rename Id column to userId.

In [5]:
users = users.rename(columns = {'Id':'userId'})
users.head()

Unnamed: 0,userId,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


#### 6. Import the posts table. 

In [None]:
posts = pd.read_sql_query('SELECT * FROM posts', engine)
posts.head()

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [None]:
posts = posts.rename(columns = {'Id':'postId', 'OwnerUserId':'userId'})
posts.head()

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [None]:
new_users = users[["userId","Reputation","Views","UpVotes","DownVotes"]]
new_posts = posts[["postId","Score","userId","ViewCount","CommentCount"]]

#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [None]:
merge_df = pd.merge(new_users, new_posts, on = ['userId'], how = 'inner')
merge_df

#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [None]:
merge_df.count()
missing_value = []

for col in merge_df:
    missing_value.append(merge_df[col].isnull().sum())
    print(col,"column has", missing_value[-1] , "missing value(s) out of", len(merge_df["ViewCount"]) ,"records.\n")

#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [None]:
#view count field has missing values
#i cant see a realtion of view count field to any other field
#i guess i would for now fill up with 0 and after discussing the same with concerned party would drop it later if not required.


merge_df['ViewCount'] = merge_df['ViewCount'].fillna(0)
merge_df

#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [None]:
#ViewCount seems to have numeric data as count info, therefore i would change this into integer data type. 
merge_df['ViewCount'] = merge_df['ViewCount'].astype("int64")
merge_df