# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
import math

import sklearn 
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer, OneHotEncoder

In [2]:
# Read dataset
countries = pd.read_csv("countries.csv")

In [3]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

In [4]:
# Reading again and replacing decimal separator
countries = pd.read_csv('./countries.csv', decimal=',')
countries.head(5)

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,,,


In [5]:
# Removing whitespace from object columns, i.e. Country and Region
countries = countries.apply(lambda x: x.str.rstrip() if x.dtype =='object' else x)

# Checking
countries['Country'][0]

'Afghanistan'

In [6]:
info = pd.DataFrame({'dtype': countries.dtypes,
                    'unique_vals': countries.nunique(),
                    'missing%': (countries.isna().sum() / countries.shape[0]) * 100
                    })
info.T

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
dtype,object,object,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
unique_vals,227,11,227,226,219,151,157,220,130,140,214,203,162,209,6,220,201,150,155,167
missing%,0,0,0,0,0,0,1.32159,1.32159,0.440529,7.92952,1.76211,0.881057,0.881057,0.881057,9.69163,1.32159,1.76211,6.60793,7.04846,6.60793


## Handle missing data
Since we might have problems related to regionality, i.e. countries in the same Region likely will share similar values. So, I chose to fill missing data with the Region mean.

However, this approach still generates problems, for example, the **Climate** feature.

In [7]:
countries_fill = countries.copy()

# Get cols name
numeric_cols = countries_fill._get_numeric_data().columns.tolist()

# Iterate to fill nan with mean values from groupby per Region
for col in numeric_cols:
    countries_fill[col] = countries_fill.groupby("Region")[col].apply(lambda x: x.fillna(x.mean()))

countries_fill.head(5)

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,0.175125,0.21525,0.608937
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,0.04448,0.246083,0.714625


## Inicia sua análise a partir daqui

In [9]:
# Sua análise começa aqui.

## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [10]:
def q1():
    # Retorne aqui o resultado da questão 1.
    return sorted(countries_fill['Region'].unique())
q1()

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

## Useful resource
*  https://pbpython.com/pandas-qcut-cut.html

*  https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114

In [11]:
def q2():
    # Retorne aqui o resultado da questão 2.
    # Transform Pop Density to numpy array and reshape it
    # Each value will be an array
    pop_density = countries_fill['Pop. Density (per sq. mi.)'].to_numpy()
    pop_density = pop_density.reshape((-1,1))

    # Calling sklearn KBins and fitting
    kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
    kbins_popd = kbins.fit(pop_density.tolist())

    # Get the 90th percentile
    percentil_90 = kbins_popd.bin_edges_[0][9]

    # Slice dataset to countries > p90 
    countries_above_p90 = countries_fill[countries_fill['Pop. Density (per sq. mi.)'] > percentil_90 ]
    return int(countries_above_p90['Country'].count())
q2()

23

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

Useful resource: https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

### Solving by using One Hot Encoding (OHE)
This question raises a problem related to missing data. Previously, I chose to fillna() by using the mean from groupby() of Region. However, doing this I created new values for the Climate variable, which changed the count of unique values (as showed on code bellow). One option to overcome this could be filling nan values with mode, but this will cause other problems.

In [12]:
climate = countries_fill[['Climate']].to_numpy().reshape((-1,1))
region = countries_fill[['Region']].to_numpy().reshape((-1,1))

# Create OneHotEnconder on data
label_encoder = OneHotEncoder(categories='auto')
climate_OHE = label_encoder.fit_transform(climate).toarray()
region_OHE = label_encoder.fit_transform(region).toarray()

# The columns number will be the new features created
new_cols = climate_OHE.shape[1] + region_OHE.shape[1]
new_cols

23

### Solving by dataframe shape
Since the question asked "how many columns will be created by applying One Hot Encoding", I can achieve this answer by adding the number of unique values from both datasets (Climate and Region), a solution more straightforward. Also, the answer accepted "consider" the nan as a possible feature.

In [13]:
def q3():
    # Retorne aqui o resultado da questão 3.
    return countries['Region'].nunique() + len(countries['Climate'].unique())
q3()

18

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [14]:
# Catch all numeric columns
numeric_cols

# Pipeline for numerical data 
# 1. fillna with median 
# 2. Standardization
preprocess = Pipeline(steps=[
                        ('imput', SimpleImputer(missing_values=np.nan, strategy='median')),
                        ('standard', StandardScaler())
                        ])

preprocessing_country = preprocess.fit(countries[numeric_cols])

### Keep in mind...
*  **Why use pipeline?** Because you can join all process in one step and apply it for multiple columns

    Example: https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/


*   **Why use ColumnTransform**? Maybe you need to handle numeric and categorical data... To do so, you can create a pipeline for each data type (i.e. numerical and categorical data) and apply both transformations at the same time. 

    Example: https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-name-after-onehotencode-in-columntransformer/54648023

In [15]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

In [16]:
# Creating a dataframa from contry test and reshaping
test_country = pd.DataFrame(test_country).T
test_country.columns = countries.columns
test_country

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Test Country,NEAR EAST,-0.190325,-0.323264,-0.0442173,-0.275281,0.132559,-0.805485,1.01198,0.618918,1.00749,0.202399,-0.0436787,-0.139297,1.31636,-0.369964,-0.61493,-0.85437,0.263445,0.571242


In [17]:
def q4():
    # Retorne aqui o resultado da questão 4.
    # Applying fitted to test_country
    test_processed = preprocessing_country.transform(test_country[numeric_cols])
    return float(round(test_processed[0, 9], 3))

q4()

-1.047

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [18]:
country_q1, country_q3 = countries.quantile(q=0.25), countries.quantile(q=0.75)

In [19]:
def q5():
    # Retorne aqui o resultado da questão 4.
    # Boxplot higher and lower tails
    country_q1, country_q3 = countries.quantile(q=0.25), countries.quantile(q=0.75)
    iqr = country_q3 - country_q1
    lower, higher = country_q1 - 1.5 * iqr,  country_q3 + 1.5 * iqr

    # Extract values from above and below tail
    outlier_lower = countries[countries < lower]
    outlier_higher = countries[countries > higher]

    result = (
            int(outlier_lower['Net migration'].dropna().count()), 
            int(outlier_higher['Net migration'].dropna().count()),
            False)

    return result
q5()

(24, 26, False)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [20]:
# dataset
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

In [21]:
def q6():
    # Retorne aqui o resultado da questão 4.


    # Word count by sklearn
    vectorizer = CountVectorizer()
    newsgroup_vector = vectorizer.fit_transform(newsgroup['data'])
    newsgroup_matrix = newsgroup_vector.toarray()

    # Feature names
    words_list = vectorizer.get_feature_names()

    # Checking if matrix and words_list have same shape
    newsgroup_matrix.shape[1] == len(words_list)

    word_count = dict(zip(words_list, newsgroup_matrix.sum(axis=0)))
    return int(word_count['phone'])
q6()

213

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

### What is TF-IDF?
> "Term Frequency - Inverse Document Frequency (...), this is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus."

> info: https://programmerbackpack.com/tf-idf-explained-and-python-implementation/

In [22]:
def q7():
    # Retorne aqui o resultado da questão 4.
    tf_vector = TfidfVectorizer(use_idf=True)
    newsgroup_tfvector = tf_vector.fit_transform(newsgroup['data'])

    # Get feature names and tf array
    tf_names = tf_vector.get_feature_names()
    tf_array = newsgroup_tfvector.toarray()

    # Dict with data
    tf_dict = dict(zip(tf_names, tf_array.sum(axis=0)))

    return float(round(tf_dict['phone'], 3))
q7()

8.888