#### 1. Import pandas library

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data 


In [2]:
from sqlalchemy import create_engine
import pymysql as mdb

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/search?tableCount%5B%5D=0-10&tableCount%5B%5D=10-30&dataType%5B%5D=Numeric&databaseSize%5B%5D=KB&databaseSize%5B%5D=MB)

In [3]:

motor=create_engine('mysql+mysqlconnector://guest:relational@relational.fit.cvut.cz:3306/stats')

#### 4. Import the users table 

In [4]:
users_table=pd.read_sql('SELECT Id FROM users', motor)
display(users_table.head())

Unnamed: 0,Id
0,-1
1,2
2,3
3,4
4,5


#### 5. Rename Id column to userId

In [5]:
users_table=users_table.rename(columns={'Id': 'userId'})
display(users_table.head())

Unnamed: 0,userId
0,-1
1,2
2,3
3,4
4,5


#### 6. Import the posts table. 

In [6]:
posts_table=pd.read_sql('SELECT Id,OwnerUserId FROM posts', motor)
display(posts_table.head())

Unnamed: 0,Id,OwnerUserId
0,17,
1,28,
2,56,
3,101,
4,152,


#### 7. Rename Id column to postId and OwnerUserId to userId

In [7]:
posts_table=posts_table.rename(columns={'Id': 'postId', 'OwnerUserId': 'userId'})
display(posts_table.head())

Unnamed: 0,postId,userId
0,17,
1,28,
2,56,
3,101,
4,152,


#### 8. Define new dataframes for users and posts with the following selected columns:
    **users columns**: userId, Reputation,Views,UpVotes,DownVotes
    **posts columns**: postId, Score,userID,ViewCount,CommentCount

In [8]:
users=pd.read_sql('SELECT Id,Reputation,Views,UpVotes,DownVotes FROM users', motor)
posts=pd.read_sql('SELECT Id,Score,ViewCount,CommentCount,OwnerUserId FROM posts', motor)

users=users.rename(columns={'Id': 'userId'})
display(users.head())
posts=posts.rename(columns={'Id': 'postId', 'OwnerUserId': 'userId'})
display(posts.head())

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


Unnamed: 0,postId,Score,ViewCount,CommentCount,userId
0,1,23,1278.0,1,8.0
1,2,22,8198.0,1,24.0
2,3,54,3613.0,4,18.0
3,4,13,5224.0,2,23.0
4,5,81,,3,23.0


#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [9]:
datos=pd.merge(users, posts, on='userId')
datos.head(10)

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0
5,-1,1,0,5007,1920,9857,0,,0
6,-1,1,0,5007,1920,9858,0,,0
7,-1,1,0,5007,1920,9860,0,,0
8,-1,1,0,5007,1920,10130,0,,0
9,-1,1,0,5007,1920,10131,0,,0


#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [10]:
null=datos.isna().sum()     # tambien vale datos.isnull().sum() 
null[null>0]

ViewCount    48396
dtype: int64

#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before passing to the next step

In [11]:
# relleno a cero porque el numero de visitas es significativo, por ejemplo para la monetizacion y no quiero eliminar la 
# mitad de los datos, podrian ser tambien significativos, como el score o reputation.

datos['ViewCount']=datos['ViewCount'].fillna(0)
datos.describe()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
count,90584.0,90584.0,90584.0,90584.0,90584.0,90584.0,90584.0,90584.0,90584.0
mean,16546.764727,6282.395412,1034.245176,734.315718,33.273249,56539.080522,2.780767,259.2534,1.89465
std,15273.367108,15102.26867,2880.074012,2050.869327,134.936435,33840.307529,4.948922,1632.261405,2.638704
min,-1.0,1.0,0.0,0.0,0.0,1.0,-19.0,0.0,0.0
25%,3437.0,60.0,5.0,1.0,0.0,26051.75,1.0,0.0,0.0
50%,11032.0,396.0,45.0,22.0,0.0,57225.5,2.0,0.0,1.0
75%,27700.0,4460.0,514.25,283.0,8.0,86145.25,3.0,111.0,3.0
max,55746.0,87393.0,20932.0,11442.0,1920.0,115378.0,192.0,175495.0,45.0


#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [12]:

datos['ViewCount']=datos['ViewCount'].astype('int64')
datos.dtypes
# el numero de visitas es siempre un numero entero, no tiene sentido el float, que era lo que me salia

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
postId          int64
Score           int64
ViewCount       int64
CommentCount    int64
dtype: object

In [31]:
# Bonus
import numpy as np

columnas=datos.columns.values.tolist()

stats=datos.describe().transpose()
stats['IQR']=stats['75%']-stats['25%']
#display(stats)

outliers=pd.DataFrame(columns=columnas)

for c in stats.index:
    iqr=stats.at[c,'IQR']
    corte=iqr*1.5
    low=stats.at[c,'25%']-corte
    up=stats.at[c,'75%']+corte
    res=datos[(datos[c]<low) | (datos[c]>up)].copy()
    res['Outlier']=c
    outliers=outliers.append(res)


display(outliers)
outliers.to_csv('outliers.csv')

Unnamed: 0,CommentCount,DownVotes,Outlier,Reputation,Score,UpVotes,ViewCount,Views,postId,userId
1155,0,126,Reputation,14082,25,4235,0,3320,74,88
1156,0,126,Reputation,14082,5,4235,0,3320,94,88
1157,1,126,Reputation,14082,7,4235,0,3320,99,88
1158,3,126,Reputation,14082,6,4235,0,3320,119,88
1159,0,126,Reputation,14082,7,4235,0,3320,140,88
1160,2,126,Reputation,14082,5,4235,0,3320,143,88
1161,1,126,Reputation,14082,8,4235,0,3320,255,88
1162,0,126,Reputation,14082,14,4235,0,3320,265,88
1163,0,126,Reputation,14082,5,4235,0,3320,275,88
1164,1,126,Reputation,14082,2,4235,0,3320,309,88


#### Bonus: Identify extreme values in your merged dataframe as you have learned in class, create a dataframe called outliers with the same columns as our data set and calculate the bounds. The values of the outliers dataframe will be the values of the merged_df that fall outside that bounds. You will need to save your outliers dataframe to a csv file on your-code folder.