# Data Cleaning 

In [1]:
import pandas as pd

# Read the users dataset.

Take a look at what is the `users.csv` separator.

In [2]:
users = pd.read_csv('../data/users.csv', sep = '#')

## Check its shape

See the number of rows and columns you're dealing.

## Use the .head() to see some rows of your dataframe.

In [3]:
users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


## Get the data info. 

Which columns have a great number of missing values? How many space does this dataframe is occupying in your memory?

In [4]:
users.loc[:, :].isna().sum().max()

32345

## Rename Id column to user_id.

Remember to store you results back at the dataframe.

In [5]:
users = users.rename(columns={'Id':'user_id'})
users

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40498,6726,1,2011-10-09 13:16:20,AlexAtStack,2012-05-18 09:32:44,,,,0,0,0,203972,,
40499,53426,101,2014-08-05 07:54:54,John J. Camilleri,2014-08-05 08:54:37,http://johnjcamilleri.com,"Gothenburg, Sweden","<p>Accidental computational linguist, de facto...",1,2,0,34865,28.0,https://www.gravatar.com/avatar/5738c02070833b...
40500,21468,101,2013-03-02 07:50:03,Peter L.,2013-03-02 07:50:03,http://www.a1qa.com/,"Minsk, Belarus","<p>QA Manager with comprehensive, cold-blooded...",1,0,0,2211454,32.0,http://www.gravatar.com/avatar/cbd80a5b2a5257d...
40501,54132,1,2014-08-15 10:52:25,user54132,2014-08-15 10:52:25,,,,1,0,0,4894117,,


# Import the `posts.csv` dataset.

Note that this is a `gzip compressed csv`. In order to read this file correctly, you'll have to read the documentation (or help) of your `pd.read_csv()` function and check the `compression` argument. Try to understand which value of `compression=...` you should put in order to read your dataframe. 

In [6]:
posts = pd.read_csv('../data/posts.csv.gzip', compression='gzip')

## Perform the same as above to understand a bit of your data (head, info, shape)

In [7]:
posts.head()

Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,OwnerUserId,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,


In [8]:
posts.info

<bound method DataFrame.info of            Id  PostTypeId  AcceptedAnswerId          CreaionDate  Score  \
0           1           1              15.0  2010-07-19 19:12:12     23   
1           2           1              59.0  2010-07-19 19:12:57     22   
2           3           1               5.0  2010-07-19 19:13:28     54   
3           4           1             135.0  2010-07-19 19:13:31     13   
4           5           2               NaN  2010-07-19 19:14:43     81   
...       ...         ...               ...                  ...    ...   
91971  115374           2               NaN  2014-09-13 23:45:39      2   
91972  115375           1               NaN  2014-09-13 23:46:05      0   
91973  115376           1               NaN  2014-09-14 01:27:54      1   
91974  115377           2               NaN  2014-09-14 02:03:28      0   
91975  115378           2               NaN  2014-09-14 02:09:23      0   

       ViewCount                                               Body

In [10]:
posts.shape

(91976, 21)

## Rename Id column to post_id and OwnerUserId to user_id.

Again, remember to check that your results are correctly stored inside the dataframe.

In [11]:
posts = posts.rename(columns = {'Id' : 'post_id', 'OwnerUserId' : 'user_id'})
posts.head()

Unnamed: 0,post_id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,user_id,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,


## Define new dataframes for users and posts with the following selected columns:

**users columns**: user_id, Reputation, Views, UpVotes, DownVotes  
**posts columns**: post_id, Score, user_id, ViewCount, CommentCount, Body

In [12]:
user_columns = ['user_id', 'Reputation', 'Views', 'UpVotes', 'DownVotes']
posts_columns = ['post_id', 'Score', 'user_id', 'ViewCount', 'CommentCount', 'Body']

users_n = users.loc[:,user_columns]

In [13]:
posts_n = posts.loc[:,posts_columns]

**Note:** Check the new posts dataframe's info. What is the most noticeable change? 

Explain why we have chosen only some columns of it in terms of efficiency.

In [14]:
users.loc[:,user_columns].info

<bound method DataFrame.info of        user_id  Reputation  Views  UpVotes  DownVotes
0           -1           1      0     5007       1920
1            2         101     25        3          0
2            3         101     22       19          0
3            4         101     11        0          0
4            5        6792   1145      662          5
...        ...         ...    ...      ...        ...
40498     6726           1      0        0          0
40499    53426         101      1        2          0
40500    21468         101      1        0          0
40501    54132           1      1        0          0
40502    39943           1      0        0          0

[40503 rows x 5 columns]>

In [15]:
posts.loc[:,posts_columns].info

<bound method DataFrame.info of        post_id  Score  user_id  ViewCount  CommentCount  \
0            1     23      8.0     1278.0             1   
1            2     22     24.0     8198.0             1   
2            3     54     18.0     3613.0             4   
3            4     13     23.0     5224.0             2   
4            5     81     23.0        NaN             3   
...        ...    ...      ...        ...           ...   
91971   115374      2    805.0        NaN             2   
91972   115375      0  49365.0        9.0             0   
91973   115376      1  55746.0        5.0             2   
91974   115377      0    805.0        NaN             0   
91975   115378      0   7250.0        NaN             0   

                                                    Body  
0      <p>How should I elicit prior distributions fro...  
1      <p>In many different statistical methods there...  
2      <p>What are some valuable Statistical Analysis...  
3      <p>I have two gr

# Merge the new dataframes you have created, of users and posts. Create a dataframe called `posts_from_users`

You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes. 

Think carefully which should be the key(s) for your merging.

In [16]:
posts_from_users = pd.merge(left = posts_n, right= users_n, on='user_id')
posts_from_users.shape

(90883, 10)

## Check the number of duplicated rows.

Remember you can sum the results of a mask to get how many numbers the True value appeared in the results. This occurs because `True` is interpreted as `1` in Python whereas `False` is interpreted as `0`.

In [17]:
posts_from_users.duplicated(keep=False).sum()

598

## Find those duplicate values and try to understand what happened.

*Hint:* You can use the argument `keep=False` from the `.duplicated()` method to bring the duplication.

*Hint 2:* You can sort the values `by=['user_id', 'post_id']` to see them in order.


In [18]:
posts_from_users.loc[posts_from_users.duplicated(keep=False), :]

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body,Reputation,Views,UpVotes,DownVotes
7481,1289,7,760.0,1139.0,8,<p>I am having difficulties to select the righ...,168,13,13,0
7482,1289,7,760.0,1139.0,8,<p>I am having difficulties to select the righ...,168,13,13,0
7483,8625,6,760.0,1799.0,3,<p>I was fiddling with PCA and LDA methods and...,168,13,13,0
7484,8625,6,760.0,1799.0,3,<p>I was fiddling with PCA and LDA methods and...,168,13,13,0
7485,23987,0,760.0,62.0,3,<p>I was studying on a PAMI article and I have...,168,13,13,0
...,...,...,...,...,...,...,...,...,...,...
90365,113691,0,54911.0,36.0,11,<p>I extract data related to a movie by sentim...,1,1,0,0
90522,114222,3,48159.0,49.0,0,"<p>Dear statistics experts,</p>\n\n<p>I have t...",16,0,0,0
90523,114222,3,48159.0,49.0,0,"<p>Dear statistics experts,</p>\n\n<p>I have t...",16,0,0,0
90557,114349,1,47999.0,34.0,0,<p>I'm having some trouble understanding Leo B...,6,2,0,0


## Should you drop it? If you think it is reasonable to drop it, then drop it.

Think: How would you correct it in the first place? That is, what was wrong in the first place?

*Hint:* There's a pandas method to drop duplicates. If you wanted to do it by hand, you could select the indexes of the duplicated values and `.drop()` it. 

In [18]:
posts_from_users.drop_duplicates()

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body,Reputation,Views,UpVotes,DownVotes
0,1,23,8.0,1278.0,1,<p>How should I elicit prior distributions fro...,6764,1089,604,25
1,16,16,8.0,,3,<p>Two projects spring to mind:</p>\n\n<ol>\n<...,6764,1089,604,25
2,36,41,8.0,67396.0,7,"<p>There is an old saying: ""Correlation does n...",6764,1089,604,25
3,65,14,8.0,,3,<p>The first formula is the <em>population</em...,6764,1089,604,25
4,78,33,8.0,,4,<p>You tend to use the covariance matrix when ...,6764,1089,604,25
...,...,...,...,...,...,...,...,...,...,...
90878,115366,1,55742.0,17.0,0,<p>Does any standard statistical software like...,6,0,0,0
90879,115370,1,55744.0,13.0,2,<p>im analyzing an article for my studies with...,6,1,0,0
90880,115371,0,35801.0,19.0,0,<p>I am trying to estimate the school effects ...,1,1,0,0
90881,115375,0,49365.0,9.0,0,<p>Assume a classification problem where there...,1,0,0,0


In [19]:
posts_from_users.shape

(90883, 10)

As duplicatas foram retiradas pois eram linhas repetidas

## 10. How many missing values do you have in your merged dataframe? On which columns?

In [20]:
posts_from_users.isnull().sum()

post_id             0
Score               0
user_id             0
ViewCount       48545
CommentCount        0
Body              220
Reputation          0
Views               0
UpVotes             0
DownVotes           0
dtype: int64

In [21]:
posts_from_users.head()

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body,Reputation,Views,UpVotes,DownVotes
0,1,23,8.0,1278.0,1,<p>How should I elicit prior distributions fro...,6764,1089,604,25
1,16,16,8.0,,3,<p>Two projects spring to mind:</p>\n\n<ol>\n<...,6764,1089,604,25
2,36,41,8.0,67396.0,7,"<p>There is an old saying: ""Correlation does n...",6764,1089,604,25
3,65,14,8.0,,3,<p>The first formula is the <em>population</em...,6764,1089,604,25
4,78,33,8.0,,4,<p>You tend to use the covariance matrix when ...,6764,1089,604,25


## Select only the rows in which there at least some missing values.

In [22]:
mask = posts_from_users.isnull().mean()
mask

post_id         0.000000
Score           0.000000
user_id         0.000000
ViewCount       0.534148
CommentCount    0.000000
Body            0.002421
Reputation      0.000000
Views           0.000000
UpVotes         0.000000
DownVotes       0.000000
dtype: float64

In [23]:
posts_from_users[['ViewCount', 'Body']]

Unnamed: 0,ViewCount,Body
0,1278.0,<p>How should I elicit prior distributions fro...
1,,<p>Two projects spring to mind:</p>\n\n<ol>\n<...
2,67396.0,"<p>There is an old saying: ""Correlation does n..."
3,,<p>The first formula is the <em>population</em...
4,,<p>You tend to use the covariance matrix when ...
...,...,...
90878,17.0,<p>Does any standard statistical software like...
90879,13.0,<p>im analyzing an article for my studies with...
90880,19.0,<p>I am trying to estimate the school effects ...
90881,9.0,<p>Assume a classification problem where there...


## You will need to make something with missing values.  Will you clean or filling them? 

Pay attention. There can be different reasons for the missings numbers. Look at the `user_id` of some of them, look at the body of the message. Which ones you're sure of what should be and which one can you infer? Don't hurry up, take a look at your data.

In [24]:
posts_from_users['ViewCount'].fillna(0)

0         1278.0
1            0.0
2        67396.0
3            0.0
4            0.0
          ...   
90878       17.0
90879       13.0
90880       19.0
90881        9.0
90882        5.0
Name: ViewCount, Length: 90883, dtype: float64

## Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [25]:
posts_from_users.loc[: , :]

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body,Reputation,Views,UpVotes,DownVotes
0,1,23,8.0,1278.0,1,<p>How should I elicit prior distributions fro...,6764,1089,604,25
1,16,16,8.0,,3,<p>Two projects spring to mind:</p>\n\n<ol>\n<...,6764,1089,604,25
2,36,41,8.0,67396.0,7,"<p>There is an old saying: ""Correlation does n...",6764,1089,604,25
3,65,14,8.0,,3,<p>The first formula is the <em>population</em...,6764,1089,604,25
4,78,33,8.0,,4,<p>You tend to use the covariance matrix when ...,6764,1089,604,25
...,...,...,...,...,...,...,...,...,...,...
90878,115366,1,55742.0,17.0,0,<p>Does any standard statistical software like...,6,0,0,0
90879,115370,1,55744.0,13.0,2,<p>im analyzing an article for my studies with...,6,1,0,0
90880,115371,0,35801.0,19.0,0,<p>I am trying to estimate the school effects ...,1,1,0,0
90881,115375,0,49365.0,9.0,0,<p>Assume a classification problem where there...,1,0,0,0


# Bonus 1: (filtering) What is the average number of comments for users who are above the average reputation?

*Hint:* Calculate the average of the user Reputation. Store it in a variable called `avg_reputation` and then use that variable for filtering the dataset and generating the results for each case (for the case in which `Reputation > {avg_reputation}` and etc.

*Hint 2:* You could create a variable based on that condition and use the group by function perform the task above.

In [45]:
avg_reputation = posts_from_users['Reputation'].mean()
print(avg_reputation)
mask = (posts_from_users['Reputation'] > avg_reputation)
mask_comment = posts_from_users.loc[mask, :]
mask_comment.groupby('user_id').mean()

6263.007812242114


Unnamed: 0_level_0,post_id,Score,ViewCount,CommentCount,Reputation,Views,UpVotes,DownVotes
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5.0,1988.692308,12.504274,8898.294118,1.581197,6792.0,1145.0,662.0,5.0
8.0,4403.53719,9.0,7441.5,2.115702,6764.0,1089.0,604.0,25.0
88.0,15152.342183,5.020649,1578.090909,1.737463,14082.0,3320.0,4235.0,126.0
159.0,32971.924915,7.150171,11725.875,2.03413,18283.0,3781.0,1014.0,59.0
183.0,24253.166329,5.8357,2597.88,1.344828,22625.0,4069.0,2496.0,45.0
251.0,2121.181818,7.530303,753.0,1.590909,7931.0,984.0,819.0,0.0
401.0,23114.122093,4.668605,496.0625,1.965116,6906.0,1070.0,394.0,6.0
442.0,33775.541985,6.435115,4427.529412,2.206107,6431.0,973.0,857.0,21.0
449.0,9676.019231,5.269231,,1.665385,12813.0,1089.0,1607.0,57.0
601.0,40879.721053,3.252632,599.25,1.971053,13478.0,1286.0,283.0,17.0


# Bonus 2: (grouping) Group your dataframe by the Reputation of your user. Calculate the mean value of ViewCount and CommentCount for each reputation value.

Suppose the missing values on ViewCount are due a systemic error and you wanted to guess what values should have been there in the first place, but the system abended.

Would that be an interesting candidate for inputting the value for the missing `ViewCount` values? If so, input it with these values.

In [51]:
mask_comment.groupby(['user_id', 'Reputation'])[['ViewCount', 'CommentCount']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,ViewCount,CommentCount
user_id,Reputation,Unnamed: 2_level_1,Unnamed: 3_level_1
5.0,6792,8898.294118,1.581197
8.0,6764,7441.5,2.115702
88.0,14082,1578.090909,1.737463
159.0,18283,11725.875,2.03413
183.0,22625,2597.88,1.344828
251.0,7931,753.0,1.590909
401.0,6906,496.0625,1.965116
442.0,6431,4427.529412,2.206107
449.0,12813,,1.665385
601.0,13478,599.25,1.971053


## refs

Sample database used: https://relational.fit.cvut.cz/dataset/Stats

Stack-overflow database: https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/
