### **Joining DataFrames with merge**


We already have all of our data sets each saved in a .csv file. Each one contains information that the others do not, so we need a way to link them together so that one set can complement the information that another lacks.

Let's read 2 of our datasets:


In [None]:
import pandas as pd

In [None]:
users = pd.read_csv('../../Datasets/MovieLens/users-raw.csv', index_col=0, names=['gender', 'age', 'occupation', 'cp'])


In [None]:
users.index.name = 'user_id'

In [None]:
occupations = pd.read_csv('../../Datasets/MovieLens/occupations-raw.csv', index_col=0, names=['description'])


In [None]:
occupations.index.name = 'occupation_id'


In [None]:
users.head()

In [None]:
occupations.head()

users contains a column named occupation that has codes that correspond to an index of the occupations table. Each code is mapped to a textual description of the occupation.


In [None]:
users_full = pd.merge(users, occupations, left_on='occupation', right_index=True).sort_index()


In [None]:
users_full


Now we could change our column names to be more descriptive:


In [None]:
users_full = users_full.rename(columns={'occupation': 'occupation_id', 'description': 'occupation'})


In [None]:
users_full


### **Grouping data with groupby**



In [None]:
users = pd.read_csv('../../Datasets/MovieLens/users-full.csv', index_col=0)

users

In [None]:
users.groupby('gender')


In [None]:
users.groupby('gender').size()


We can also request specific columns from our groups and apply aggregations to each column:


In [None]:
users.groupby('gender')['occupation'].value_counts()


We can use two or more columns to group as well. What happens is that the dataset is grouped using the first column, and then, within each group, a second grouping is done using the second column:


In [None]:
users.groupby(['gender', 'age_range'])['occupation'].value_counts()


In [None]:
users_ga_counts.loc['F']


In [None]:
users_ga_counts.loc[('F', '18-24')]


Now, not all functions are available "out-of-the-box" to be applied to groupby objects. There are some functions that we cannot use directly and in order to apply them we need to use the agg method. agg receives a function or a list of functions and applies them to the requested columns in each group.


In [None]:
users.groupby('gender')['occupation'].agg(pd.Series.mode)


We can apply the function to two columns at the same time:


In [None]:
users.groupby('gender')[['age_range', 'occupation']].agg(pd.Series.mode)


And we can also apply multiple functions at the same time by passing a list of functions to agg. In this case we are going to use some statistical analysis on the age_id column. Actually these analyzes are not going to be precise because this column contains ids that represent age ranges, not ages as such. But consider it a simple example to see how the tools work:


In [None]:
users.groupby('gender')['age_id'].agg(['mean', 'median', 'std'])


### **Location estimators**

Location estimators help us determine which value best describes a data set. We call this value the "typical value" of our set. Two estimates are the most common and most used:

Average (or mean)
Median
Let's see how they are calculated using pandas.

In [None]:
df = pd.read_csv('../../Datasets/melbourne_housing-clean.csv', index_col=0)


In [None]:
df.head()


**MEAN**

The mean or average is obtained by adding all the values ​​of a numerical data set and dividing them by the number of values ​​that we have in our set.


Let's analyze the price column. Let's see what is the "typical value" obtained using the mean (average):


In [None]:
df['price'].mean()


**MEDIAN**

The median is obtained as follows:

First we sort our data in ascending order.
We then take the value that is right in the middle of our ordered sequence of values.
If our set has an even number of values ​​and therefore does not have a value right in the middle of the sequence, we take the average of the two values ​​that are in the middle of the sequence.
Now let's look at the "typical value" obtained using the median:


In [None]:
df['price'].median()


**Truncated Mean**


The trimmed mean is a more robust estimate of location than the mean and median. This means that it is less sensitive to outliers. The trimmed mean is obtained as follows:

We first order our set in ascending order.
We then decide what percentage of our data we are going to truncate. The most common values ​​usually vary between 5% and 25%.
Divide the agreed percentage by two and remove that fraction of your data from the start and end of your sequence. For example, if you decide to truncate 5%, remove 2.5% of your data from the start of your stream and 2.5% from the end of your stream.
Get the average of the remaining values.


Fortunately, we don't have to do this manually. The scipy library already offers a method to get the truncated mean easily:


In [None]:
from scipy import stats


In [None]:
df.head()


In [None]:
stats.trim_mean(df['price'], 0.1)


**Standard deviation**


To obtain the standard deviation, the following steps are carried out:

- First we obtain the average of our data.
- We then extract all the differences between each value in our set and our typical value.
- Then we square all the results.
- Then all these values ​​are added.
- They are then divided by the number of values ​​- 1.
- Finally, the square root of the resulting value is taken.


In [None]:
df['price'].std()


The higher our result means that our data is more dispersed (that is, there are many data that are far from our typical value); the lower the result means that our data is less spread out (that is, they are closer to our typical value).

Obviously we have to take into account the range of our values ​​to determine if our standard deviation is small or large. For example, a standard deviation of 10 is very small if our values ​​have a range of 1,000,000. By contrast, a standard deviation of 10 is much larger if our values ​​have a range of 40.


**CORRELATION**

We say that two variables are positively correlated if the increase in values ​​in one of them is related to the increase in values ​​in the other; and if the decrease in values ​​in one is related to the decrease in values ​​in the other.

Instead, we say that they are negatively correlated if the increase in the values ​​of one is related to the decrease in the values ​​of the other, and vice versa.


In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set_style('white')

In [None]:
arr_1_1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
arr_1_2 = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

plt.scatter(arr_1_1, arr_1_2, c='m');
plt.plot(arr_1_1, arr_1_2, c='c');

In [None]:
arr_2_1 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
arr_2_2 = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

plt.scatter(arr_2_1, arr_2_2, c='m');
plt.plot(arr_2_1, arr_2_2, c='c');

In [None]:
arr_3_1 = pd.Series([1, 7, 1, 22, 54, 2, 7, 26, 3, 13, 37, 87, 63, 15, 16, 74, 56, 95, 78, 61, 12, 43, 63, 84])
arr_3_2 = pd.Series([64, 43, 12, 4, 75, 46, 94, 46, 24, 5, 85, 67, 98, 15, 12, 53, 3, 85, 36, 24, 74, 57, 64, 13])

plt.scatter(arr_3_1, arr_3_2, c='m');

In [None]:
print(f'Correlation between the first two Series: {arr_1_1.corr(arr_1_2)}')


In [None]:
print(f'Correlation between the second two Series: {arr_2_1.corr(arr_2_2)}')


In [None]:
print(f'Correlation between the third two Series: {arr_3_1.corr(arr_3_2)}')


**Correlation matrix and heat maps**


In [None]:
df = pd.read_csv('../../Datasets/diabetes-clean.csv', index_col=0)


In [None]:
df.head()


In [None]:
df_filtered = df.drop(columns=['outcome'])


In [None]:
df_filtered.corr()


Let's now use a heatmap to visualize this matrix in a way that is easier to interpret:


In [None]:
plt.figure(figsize=(8, 6))
ax = sns.heatmap(df_filtered.corr(), vmin=-1, vmax=1, annot=True, cmap="YlGnBu", linewidths=.5);
