# Data Cleaning 

#### 1. Import pandas library.

In [None]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [None]:
import pymysql
from sqlalchemy import create_engine

#### 3. Create a mysql engine to set the connection to the server. 
Check the connection details here [here](https://relational.fit.cvut.cz/dataset/Stats)

In [None]:
engine = create_engine('mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306')

#### 4. Import the users table.

In [None]:
users = pd.read_sql_query('SELECT * FROM stats.users', engine)
users.head()

#### 5. Rename Id column to userId.

In [None]:
users = users.rename(columns = {'Id': 'userId'})
users.head()

#### 6. Import the posts table. 

In [None]:
posts = pd.read_sql_query('SELECT * FROM stats.posts', engine)
posts.head()

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [None]:
posts = posts.rename(columns = {'Id': 'postId', 'OwnerUserId': 'userId'})
posts.head()

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [None]:
user_columns = users[['userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]

posts_columns = posts[['postId', 'Score', 'userId', 'ViewCount', 'CommentCount']]


#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [None]:
merged = pd.merge(user_columns, posts_columns, on = 'userId', how = 'inner')
merged.head()

#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [None]:
null_cols = merged.isnull().sum()
null_cols[null_cols > 0]

# 48396 missing values in the column 'ViewCount'

#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [None]:
merged.shape
# The NaNs make up more than half of the values in the ViewCount column and after a bit of searching don't seem to correspond with values in other columns. It doesn't correspond to Views. So I'm removing the column.  

merged_clean = merged.drop(['ViewCount'], axis=1)
merged_clean.head()

#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [None]:
merged_clean.dtypes

In [None]:
# It looks like all the other columns are integer type. So I don't see why I should change any...