<img src="./assets/img/teclab_logo.png" alt="Teclab logo" width="170">

**Autor**: Hector Vergara ([LinkedIn](https://www.linkedin.com/in/hector-vergara/))

# API 1:

### Situación
¡Felicidades! Recientemente, nos han contratado como data scientist junior para la empresa Trump and Co. Matriz latinoamericana de Trump International, cuyo fondo de inversiones se dedicaba a los bienes raíces (real state). Nuestro nuevo jefe desea implementar un modelo de machine learning basado en regresión lineal que permita predecir el precio de las propiedades.
Para diseñar el modelo de machine learning, se ha dispuesto la estrategia de modelar un modelo Dummy o de prueba a partir de un caso internacional conocido como Ames Housing Dataset.
El conjunto de datos Ames fue introducido por el profesor Dean De Cok en 2011 como una variante del caso Boston del año 1978.
## Consignas
El primer procedimiento que realizaremos corresponde a descargar el dataset, el cual podrá estar en formato Excel o CSV. A continuación, debemos reconocer las librerías necesarias para llevar a cabo la transformación de los datos en un data frame. También será necesario trabajar con un conjunto de pruebas equivalente al 20 % del
total de los datos. Es de suma importancia reconocer el número total de observaciones que cuentan con datos perdidos en el data set y con los datos nulos. La primera parte del estudio previo estará compuesto por la matriz de correlación con las variables con una correlación superior al 60 % (equivalente a 0.60)

In [135]:
import os
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

__version__ = '0.0.1'
__email__ = 'hhvservice@gmail.com'
__author__ = 'Hector Vergara'
__annotations__ = 'https://www.linkedin.com/in/hector-vergara/'
__base_dir__ = Path().absolute()
__data_dir__ = os.path.join(__base_dir__, 'data')
filename_data = os.path.join(__data_dir__, 'AmesHousing.csv')
printing = lambda text: print("\033[92m" + text + "\033[0m")

In [136]:
# Load data from kaggle dataset
# Dataset Source: https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset
df = pd.read_csv(filename_data)
df.head(20)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
5,6,527105030,60,RL,78.0,9978,Pave,,IR1,Lvl,...,0,,,,0,6,2010,WD,Normal,195500
6,7,527127150,120,RL,41.0,4920,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,213500
7,8,527145080,120,RL,43.0,5005,Pave,,IR1,HLS,...,0,,,,0,1,2010,WD,Normal,191500
8,9,527146030,120,RL,39.0,5389,Pave,,IR1,Lvl,...,0,,,,0,3,2010,WD,Normal,236500
9,10,527162130,60,RL,60.0,7500,Pave,,Reg,Lvl,...,0,,,,0,6,2010,WD,Normal,189000


In [137]:
# Check for missing data (null values)
missing_data = df.isnull().sum().sum()
int(missing_data)

15749

In [138]:
# Split data into train and test dataframes
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Extracting the numeric columns from the dataframe in a subset variable:
numeric_subset = df.select_dtypes(include=['float64', 'int64'])
# Now, create the correlation matrix:
correlation_matrix = numeric_subset.corr()
correlation_matrix.head(10)

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
Order,1.0,0.173593,0.011797,-0.007034,0.031354,-0.0485,-0.011054,-0.052319,-0.075566,-0.030907,...,-0.011292,0.016355,0.027908,-0.024975,0.004307,0.052518,-0.006083,0.133365,-0.975993,-0.031408
PID,0.173593,1.0,-0.001281,-0.096918,0.034868,-0.263147,0.104451,-0.343388,-0.157111,-0.229283,...,-0.051135,-0.071311,0.162519,-0.024894,-0.025735,-0.002845,-0.00826,-0.050455,0.009579,-0.246521
MS SubClass,0.011797,-0.001281,1.0,-0.420135,-0.204613,0.039419,-0.067349,0.036579,0.043397,0.00273,...,-0.01731,-0.014823,-0.022866,-0.037956,-0.050614,-0.003434,-0.029254,0.00035,-0.017905,-0.085092
Lot Frontage,-0.007034,-0.096918,-0.420135,1.0,0.491313,0.212042,-0.074448,0.121562,0.091712,0.222407,...,0.120084,0.16304,0.012758,0.028564,0.076666,0.173947,0.044476,0.011085,-0.007547,0.357318
Lot Area,0.031354,0.034868,-0.204613,0.491313,1.0,0.097188,-0.034759,0.023258,0.021682,0.12683,...,0.157212,0.10376,0.021868,0.016243,0.055044,0.093775,0.069188,0.003859,-0.023085,0.266549
Overall Qual,-0.0485,-0.263147,0.039419,0.212042,0.097188,1.0,-0.094812,0.597027,0.569609,0.429418,...,0.255663,0.298412,-0.140332,0.01824,0.041615,0.030399,0.005179,0.031103,-0.020719,0.799262
Overall Cond,-0.011054,0.104451,-0.067349,-0.074448,-0.034759,-0.094812,1.0,-0.368773,0.04768,-0.13534,...,0.020344,-0.068934,0.071459,0.043852,0.044055,-0.016787,0.034056,-0.007295,0.031207,-0.101697
Year Built,-0.052319,-0.343388,0.036579,0.121562,0.023258,0.597027,-0.368773,1.0,0.612095,0.313292,...,0.228964,0.198365,-0.374364,0.015803,-0.041436,0.002213,-0.011011,0.014577,-0.013197,0.558426
Year Remod/Add,-0.075566,-0.157111,0.043397,0.091712,0.021682,0.569609,0.04768,0.612095,1.0,0.196928,...,0.217857,0.241748,-0.220383,0.037412,-0.046888,-0.01141,-0.003132,0.018048,0.032652,0.532974
Mas Vnr Area,-0.030907,-0.229283,0.00273,0.222407,0.12683,0.429418,-0.13534,0.313292,0.196928,1.0,...,0.165467,0.143748,-0.110787,0.013778,0.065643,0.004617,0.044934,-0.000276,-0.017715,0.508285


In [139]:
# Sellecting the correlation variables up to 60% (0.6) of correlation:
correlation_threshold = 0.60
correlation_up_to_60 = correlation_matrix[abs(correlation_matrix) > correlation_threshold]
correlation_up_to_60.head(10)

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
Order,1.0,,,,,,,,,,...,,,,,,,,,-0.975993,
PID,,1.0,,,,,,,,,...,,,,,,,,,,
MS SubClass,,,1.0,,,,,,,,...,,,,,,,,,,
Lot Frontage,,,,1.0,,,,,,,...,,,,,,,,,,
Lot Area,,,,,1.0,,,,,,...,,,,,,,,,,
Overall Qual,,,,,,1.0,,,,,...,,,,,,,,,,0.799262
Overall Cond,,,,,,,1.0,,,,...,,,,,,,,,,
Year Built,,,,,,,,1.0,0.612095,,...,,,,,,,,,,
Year Remod/Add,,,,,,,,0.612095,1.0,,...,,,,,,,,,,
Mas Vnr Area,,,,,,,,,,1.0,...,,,,,,,,,,


In [140]:
# Cleaning the correlation matrix by removing values up to +/-1:
filtered_correlation = correlation_up_to_60.where(abs(correlation_up_to_60) < 1)
filtered_correlation.head(10)

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
Order,,,,,,,,,,,...,,,,,,,,,-0.975993,
PID,,,,,,,,,,,...,,,,,,,,,,
MS SubClass,,,,,,,,,,,...,,,,,,,,,,
Lot Frontage,,,,,,,,,,,...,,,,,,,,,,
Lot Area,,,,,,,,,,,...,,,,,,,,,,
Overall Qual,,,,,,,,,,,...,,,,,,,,,,0.799262
Overall Cond,,,,,,,,,,,...,,,,,,,,,,
Year Built,,,,,,,,,0.612095,,...,,,,,,,,,,
Year Remod/Add,,,,,,,,0.612095,,,...,,,,,,,,,,
Mas Vnr Area,,,,,,,,,,,...,,,,,,,,,,


In [141]:
# Exctractic the highest correlation variables:
# "Unrolling" the matrix and sort by absolute correlation
correlation_pairs = filtered_correlation.unstack().dropna().sort_values(ascending=False)

# Removing duplicate values:
correlation_pairs = correlation_pairs[correlation_pairs.index.get_level_values(0) < correlation_pairs.index.get_level_values(1)]
print(f"""
Total number of correlation pairs up to 60%: {len(correlation_pairs)}
Top 5 correlation pairs:

{correlation_pairs.head(5)}
""".replace('dtype: float64', ''))


Total number of correlation pairs up to 60%: 18
Top 5 correlation pairs:

Garage Area    Garage Cars      0.889676
Garage Yr Blt  Year Built       0.834849
Gr Liv Area    TotRms AbvGrd    0.807772
1st Flr SF     Total Bsmt SF    0.800720
Overall Qual   SalePrice        0.799262


