# Exercise - 05/05/2002

Consider the dataset below that contains **heights** (in inches/inches) and **weights** (in pounds/lbs) of fake people. <br/>
https://www.kaggle.com/mustafaali96/weight-height

Perform the following tasks:
- Split the dataset into 80% for training and 20% for testing
- Compute the correlation between the training set variables
- Train a regression model considering **height** as the independent variable and **weight** as the dependent one
   + Compute the model determination coefficient
   + Plot a scatterplot of the two variables containing the regression model (line)
- Predict the test set
   + Plot a scatterplot of the two variables containing the regression model (line)
   + Compute error metrics for regression

***

In [1]:
#importing used modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
#loading the dataset
df = pd.read_csv('datasets/weight-height.csv')

In [3]:
#showing basic info about the dataset
df.head()

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


In [4]:
df.shape

(10000, 3)

In [5]:
df['Gender'].drop_duplicates()

0         Male
5000    Female
Name: Gender, dtype: object

The dataset has 10000 lines and three columns, divided in two genders only. We can also make two copies of the set for each gender so we can see the differences later.

In [6]:
df_males = df.groupby(df.Gender).get_group('Male')

In [7]:
#Now this dataset only has the data of males
df_males['Gender'].drop_duplicates()

0    Male
Name: Gender, dtype: object

In [8]:
#the gender column is now unnecessary, we can remove it
df_males = df_males.drop('Gender', axis=1, inplace=False)

In [9]:
df_males.shape

(5000, 2)

In [10]:
#repeating the steps for the females' dataset
df_females = df.groupby(df.Gender).get_group('Female')
df_females = df_females.drop('Gender', axis=1, inplace=False)
df_females.shape

(5000, 2)

We now have the datasets to work with. Let's start with the exercise proper.

***

1- Split the dataset into 80% for training and 20% for testing

In [11]:
X = df[['Height']] #independent variable
y = df[['Weight']] #dependent variable

In [12]:
#splitting the set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1215)

In [13]:
#shape of the test data
print(f'X_train size: {X_train.shape[0]}\nX_test size: {X_test.shape[0]}\ny_train size: {y_train.shape[0]}\ny_test size: {y_test.shape[0]}')

X_train size: 8000
X_test size: 2000
y_train size: 8000
y_test size: 2000


The set is now split 80/20

2 - Compute the correlation between the training set variables

In [14]:
corr_set = df['Height'].corr(df['Weight'])
corr_train = X_train['Height'].corr(y_train['Weight'])
print(f'The general correlation of the weight and height in the whole dataset is {corr_set:.4f}. For the training set it is {corr_train:.4f}.')

The general correlation of the weight and height in the whole dataset is 0.9248. For the training set it is 0.9237.


3 - Train a regression model considering **height** as the independent variable and **weight** as the dependent one

In [15]:
reg = LinearRegression()

In [16]:
#The set was already divided using height as independent variable and weight as the dependent one
reg.fit(X_train, y_train)

LinearRegression()

In [17]:
print(f'The intercept of this regression (theta0) is {reg.intercept_[0]:.4f}, and the coefficient (theta1, slope) is {reg.coef_[0][0]:.4f}.')

The intercept of this regression (theta0) is -351.4145, and the coefficient (theta1, slope) is 7.7281.


4 - Compute the model determination coefficient

In [18]:
R2 = reg.score(X_train,y_train)
print(f' The model determination coefficient (R²) is {R2:.4f}. It explains {R2:.2%} of the variance.')

 The model determination coefficient (R²) is 0.8532. It explains 85.32% of the variance.


5 - Plot a scatterplot of the two variables containing the regression model (line)

In [28]:
x_line = X['Height']
y_line = reg.predict(X)
y_line

array([[219.28505205],
       [180.14118465],
       [221.31823403],
       ...,
       [142.16581979],
       [182.09129469],
       [127.29885199]])

In [None]:
passageiros_predito = pd.DataFrame({'Tempo': np.ndarray.flatten(Tempo_teste),
                                    'nPassageiros': nPassageiros_predito,
             })
passageiros_teste = pd.DataFrame({'Tempo': np.ndarray.flatten(Tempo_teste),
                                    'nPassageiros': nPassageiros_teste,
             })

In [None]:
regression_line_x = pd.Series(df.X_train)

In [None]:
regression_line_x = pd.Series(df.X_train)
regression_line_y = pd.Series(df.Y)

sns.scatterplot(data=X_train, x=X_train['Height'], y=y_train)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(20, 6), sharex=True, sharey=True)

regression_line_x = X_train
regression_line_y = reg.predict(X)

sns.scatterplot(x=X_train, y=y_train)
sns.scatterplot(x=X_train, y=y_train)
#sns.lineplot(x=regression_line_x, y=regression_line_y, color="red")
axs[0].set_title('Annual Salary vs Years of Experience (Training Set)')
axs[0].set_xlabel('Experience (Years)')
axs[0].set_ylabel('Salary ($)')

Dúvidas