# Machine Learning Test

You have 3 days to solve the test from the moment you receive it. 

Show us your skills !

## Problem description

You are hired as a Data Scientist at a top real state company in California, and you first job is to develop an ML model to predict house prices. This model will then be used as an investment tool in your company, to buy houses when their price is lower than their real value or negotiating overprices.

## Get the data

Over the years, your company has been gathering some data that you can start using. It is available in the following link: https://mymldatasets.s3.eu-de.cloud-object-storage.appdomain.cloud/housing.tgz

In [1]:
# download the data

In [2]:
## Imports

In [3]:
# Data Analysis
import pandas as pd
import numpy as np
import math

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap

# Data Visualization for text
from PIL import Image
from os import path
import os
import random
from wordcloud import WordCloud, STOPWORDS

# Text Processing
import re
import itertools
import spacy
import string
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS as STOP_WORDS_EN
import en_core_web_sm

from spacy.lang.es import Spanish
from spacy.lang.es.stop_words import STOP_WORDS as STOP_WORDS_ES
import es_core_news_sm

from spacy.lang.xx import MultiLanguage
STOP_WORDS = STOP_WORDS_EN | STOP_WORDS_ES
import xx_ent_wiki_sm

from collections import Counter

# Ignore noise warning
import warnings
warnings.filterwarnings('ignore')

# Work with pickles
import pickle

pd.set_option('display.max_column', None)

## Explore the data

Load the data to explore it. Try to answer the following questions:

- How many features has the dataset ? What type are they ?
- How many samples are in the dataset ? 
- Are there missing values ? 
- Is there any correlation between features ?

In [4]:
housing_df = pd.read_csv('e-sports/data/housing.csv')
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
housing_df.shape

(20640, 10)

In [6]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [7]:
housing_df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [8]:
housing_df.duplicated().sum()

0

In [9]:
housing_df.nunique()

longitude               844
latitude                862
housing_median_age       52
total_rooms            5926
total_bedrooms         1923
population             3888
households             1815
median_income         12928
median_house_value     3842
ocean_proximity           5
dtype: int64

In [10]:
housing_df.describe().to_csv('e-sports/data/output_csv/descriptive_stats.csv')

In [11]:
housing_df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


## Visualize the data

Try to come up with the best possible visualizations (most informative) for the dataset.

In [0]:
# visualize the data

## Feature engineering

Try to combine several features into new ones and see if they have better correlation.

In [0]:
# feature engineering

## Prepare the data for ML algorithms

Using [Scikit-learn](https://scikit-learn.org/stable/) build a pipeline for data preparation. The pipeline should include:

- Data cleaning
- Encoding of categorical features
- Feature scaling

In [0]:
# data preparation

## Train some models

Train a list of models on default parameters to get a quick idea of the performance of each one with the dataset. Compare their performance and keep the top 3-5 models for the next step.

In [0]:
# train a lot of models

## Fine-tune best models

Use Scitkit-learn random search hyperparameter tuning to find the best hyperparameters for the best models selected in the previus step.

In [0]:
# fine tune best models

## Build an ensamble

Build an ensamble with the best fine-tuned models and evaluate its performance.

In [0]:
# build an ensamble

## Optional

For extra points try to:

- Implement custom transformers for feature selection and feature engineering.
- Include the data preparation and custom transformers in the fine-tuning pipeline to come up with the best set of features and data preparation strategies.
- Perform error analysis to select the models that make the most diverse type of errors for the ensambling.
- Do anything that teach us something new !

In [0]:
# show your skills !