# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [3]:
import pymysql
from sqlalchemy import create_engine

#### 3. Create a mysql engine to set the connection to the server. 
Check the connection details here [here](https://relational.fit.cvut.cz/dataset/Stats)

In [4]:
engine = create_engine('mysql+pymysql://guest:relational@relational.fit.cvut.cz/stats')

#### 4. Import the users table.

In [5]:
users = pd.read_sql_query('SELECT * FROM users', engine)
print(users.head())

   Id  Reputation        CreationDate   DisplayName      LastAccessDate  \
0  -1           1 2010-07-19 06:55:26     Community 2010-07-19 06:55:26   
1   2         101 2010-07-19 14:01:36  Geoff Dalgas 2013-11-12 22:07:23   
2   3         101 2010-07-19 15:34:50  Jarrod Dixon 2014-08-08 06:42:58   
3   4         101 2010-07-19 19:03:27        Emmett 2014-01-02 09:31:02   
4   5        6792 2010-07-19 19:03:57         Shane 2014-08-13 00:23:47   

                       WebsiteUrl            Location  \
0  http://meta.stackexchange.com/  on the server farm   
1        http://stackoverflow.com       Corvallis, OR   
2        http://stackoverflow.com        New York, NY   
3    http://minesweeperonline.com   San Francisco, CA   
4         http://www.statalgo.com        New York, NY   

                                             AboutMe  Views  UpVotes  \
0  <p>Hi, I'm not really a person.</p>\n\n<p>I'm ...      0     5007   
1  <p>Developer on the StackOverflow team.  Find ...     25   

#### 5. Rename Id column to userId.

In [12]:
users_new = users.rename(columns = {'Id' : 'userId'})
print(users_new.head())

   userId  Reputation        CreationDate   DisplayName      LastAccessDate  \
0      -1           1 2010-07-19 06:55:26     Community 2010-07-19 06:55:26   
1       2         101 2010-07-19 14:01:36  Geoff Dalgas 2013-11-12 22:07:23   
2       3         101 2010-07-19 15:34:50  Jarrod Dixon 2014-08-08 06:42:58   
3       4         101 2010-07-19 19:03:27        Emmett 2014-01-02 09:31:02   
4       5        6792 2010-07-19 19:03:57         Shane 2014-08-13 00:23:47   

                       WebsiteUrl            Location  \
0  http://meta.stackexchange.com/  on the server farm   
1        http://stackoverflow.com       Corvallis, OR   
2        http://stackoverflow.com        New York, NY   
3    http://minesweeperonline.com   San Francisco, CA   
4         http://www.statalgo.com        New York, NY   

                                             AboutMe  Views  UpVotes  \
0  <p>Hi, I'm not really a person.</p>\n\n<p>I'm ...      0     5007   
1  <p>Developer on the StackOverflow t

#### 6. Import the posts table. 

In [8]:
posts = pd.read_sql_query('SELECT * FROM posts', engine)
print(posts.head())

   Id  PostTypeId  AcceptedAnswerId         CreaionDate  Score  ViewCount  \
0   1           1              15.0 2010-07-19 19:12:12     23     1278.0   
1   2           1              59.0 2010-07-19 19:12:57     22     8198.0   
2   3           1               5.0 2010-07-19 19:13:28     54     3613.0   
3   4           1             135.0 2010-07-19 19:13:31     13     5224.0   
4   5           2               NaN 2010-07-19 19:14:43     81        NaN   

                                                Body  OwnerUserId  \
0  <p>How should I elicit prior distributions fro...          8.0   
1  <p>In many different statistical methods there...         24.0   
2  <p>What are some valuable Statistical Analysis...         18.0   
3  <p>I have two groups of data.  Each with a dif...         23.0   
4  <p>The R-project</p>\n\n<p><a href="http://www...         23.0   

      LasActivityDate                                              Title  ...  \
0 2010-09-15 21:08:26                    

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [10]:
posts_new = posts.rename(columns = {'Id' : 'postId', 'OwnerUserId': 'userId'})
print(posts_new.head())

   postId  PostTypeId  AcceptedAnswerId         CreaionDate  Score  ViewCount  \
0       1           1              15.0 2010-07-19 19:12:12     23     1278.0   
1       2           1              59.0 2010-07-19 19:12:57     22     8198.0   
2       3           1               5.0 2010-07-19 19:13:28     54     3613.0   
3       4           1             135.0 2010-07-19 19:13:31     13     5224.0   
4       5           2               NaN 2010-07-19 19:14:43     81        NaN   

                                                Body  userId  \
0  <p>How should I elicit prior distributions fro...     8.0   
1  <p>In many different statistical methods there...    24.0   
2  <p>What are some valuable Statistical Analysis...    18.0   
3  <p>I have two groups of data.  Each with a dif...    23.0   
4  <p>The R-project</p>\n\n<p><a href="http://www...    23.0   

      LasActivityDate                                              Title  ...  \
0 2010-09-15 21:08:26                      Elic

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [15]:
users_columns = users_new[['userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]
posts_columns = posts_new[['postId', 'Score', 'userId', 'ViewCount', 'CommentCount']]

print(users_columns.head())
print('\n')
print(posts_columns.head())

   userId  Reputation  Views  UpVotes  DownVotes
0      -1           1      0     5007       1920
1       2         101     25        3          0
2       3         101     22       19          0
3       4         101     11        0          0
4       5        6792   1145      662          5


   postId  Score  userId  ViewCount  CommentCount
0       1     23     8.0     1278.0             1
1       2     22    24.0     8198.0             1
2       3     54    18.0     3613.0             4
3       4     13    23.0     5224.0             2
4       5     81    23.0        NaN             3


#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [16]:
merge = users_columns.merge(posts_columns, on = 'userId')
print(merge.head())

   userId  Reputation  Views  UpVotes  DownVotes  postId  Score  ViewCount  \
0      -1           1      0     5007       1920    2175      0        NaN   
1      -1           1      0     5007       1920    8576      0        NaN   
2      -1           1      0     5007       1920    8578      0        NaN   
3      -1           1      0     5007       1920    8981      0        NaN   
4      -1           1      0     5007       1920    8982      0        NaN   

   CommentCount  
0             0  
1             0  
2             0  
3             0  
4             0  


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [26]:
print(merge.count())
null_cols = merge.isnull().sum()
print('\n')
print("Number of missing values:")
print(null_cols[null_cols > 0])

#48396 missing values in column viewcount

userId          90584
Reputation      90584
Views           90584
UpVotes         90584
DownVotes       90584
postId          90584
Score           90584
ViewCount       42188
CommentCount    90584
dtype: int64


Number of missing values:
ViewCount    48396
dtype: int64


#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [54]:
#check for userID with several viewcounts
print(merge.groupby('userId')['ViewCount'].nunique())

#view results of userid 5
pd.set_option('display.max_rows', None)
print(merge[merge['userId'] == 5])
pd.reset_option('display.max_rows')

#I see no relation between userId and ViewCount, so replace NaN with zeros in next cell

userId
-1         0
 5        17
 6         6
 7         1
 8        12
          ..
 55734     1
 55738     1
 55742     1
 55744     1
 55746     1
Name: ViewCount, Length: 21983, dtype: int64
     userId  Reputation  Views  UpVotes  DownVotes  postId  Score  ViewCount  \
211       5        6792   1145      662          5       6    152    29229.0   
212       5        6792   1145      662          5      12     20        NaN   
213       5        6792   1145      662          5      32     12        NaN   
214       5        6792   1145      662          5      49      6        NaN   
215       5        6792   1145      662          5      64      6        NaN   
216       5        6792   1145      662          5      76     22        NaN   
217       5        6792   1145      662          5      83      2        NaN   
218       5        6792   1145      662          5      96      4        NaN   
219       5        6792   1145      662          5     103     28     1990.0   
220  

In [55]:
merge["ViewCount"] = merge["ViewCount"].fillna(0)
print(merge)

       userId  Reputation  Views  UpVotes  DownVotes  postId  Score  \
0          -1           1      0     5007       1920    2175      0   
1          -1           1      0     5007       1920    8576      0   
2          -1           1      0     5007       1920    8578      0   
3          -1           1      0     5007       1920    8981      0   
4          -1           1      0     5007       1920    8982      0   
...       ...         ...    ...      ...        ...     ...    ...   
90579   55734           1      0        0          0  115352      0   
90580   55738          11      0        0          0  115360      2   
90581   55742           6      0        0          0  115366      1   
90582   55744           6      1        0          0  115370      1   
90583   55746         106      1        0          0  115376      1   

       ViewCount  CommentCount  
0            0.0             0  
1            0.0             0  
2            0.0             0  
3            0.

#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [56]:
print(merge.dtypes)

userId            int64
Reputation        int64
Views             int64
UpVotes           int64
DownVotes         int64
postId            int64
Score             int64
ViewCount       float64
CommentCount      int64
dtype: object


In [57]:
merge["ViewCount"] = merge["ViewCount"].astype("int64")
print(merge.dtypes)

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
postId          int64
Score           int64
ViewCount       int64
CommentCount    int64
dtype: object
