# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [221]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import sklearn as sk
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [3]:
# Algumas configurações para o matplotlib.
#%matplotlib inline

#from IPython.core.pylabtools import figsize


#figsize(12, 8)

#sns.set()

In [4]:
countries = pd.read_csv("countries.csv")

In [5]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [7]:
# Sua análise começa aqui.
pd.DataFrame({'dtypes': countries.dtypes,
             'missing values': countries.isna().sum()})

Unnamed: 0,dtypes,missing values
Country,object,0
Region,object,0
Population,int64,0
Area,int64,0
Pop_density,object,0
Coastline_ratio,object,0
Net_migration,object,3
Infant_mortality,object,3
GDP,float64,1
Literacy,object,18


In [8]:
countries.describe()

Unnamed: 0,Population,Area,GDP
count,227.0,227.0,226.0
mean,28740280.0,598227.0,9689.823009
std,117891300.0,1790282.0,10049.138513
min,7026.0,2.0,500.0
25%,437624.0,4647.5,1900.0
50%,4786994.0,86600.0,5550.0
75%,17497770.0,441811.0,15700.0
max,1313974000.0,17075200.0,55100.0


In [25]:
#applymap: Apply a function to a Dataframe elementwise.
#map: Map values of Series according to input correspondence.
#strip: returns a copy of the string with both leading and trailing characters removed.
countries['Region'] = countries['Region'].map(lambda region: region.strip())
countries['Region']

0      ASIA (EX. NEAR EAST)
1            EASTERN EUROPE
2           NORTHERN AFRICA
3                   OCEANIA
4            WESTERN EUROPE
               ...         
222               NEAR EAST
223         NORTHERN AFRICA
224               NEAR EAST
225      SUB-SAHARAN AFRICA
226      SUB-SAHARAN AFRICA
Name: Region, Length: 227, dtype: object

In [93]:
countries['Coastline_ratio'] = countries['Coastline_ratio'].str.replace(',', '.').astype(float)
countries['Infant_mortality'] = countries['Infant_mortality'].str.replace(',', '.').astype(float)
countries['Pop_density'] = countries['Pop_density'].str.replace(',', '.').astype(float)
countries['Net_migration'] = countries['Net_migration'].str.replace(',', '.').astype(float)
countries['Literacy'] = countries['Literacy'].str.replace(',', '.').astype(float)
countries['Phones_per_1000'] = countries['Phones_per_1000'].str.replace(',', '.').astype(float)
countries['Arable'] = countries['Arable'].str.replace(',', '.').astype(float)
countries['Crops'] = countries['Crops'].str.replace(',', '.').astype(float)
countries['Other'] = countries['Other'].str.replace(',', '.').astype(float)
countries['Birthrate'] = countries['Birthrate'].str.replace(',', '.').astype(float)
countries['Deathrate'] = countries['Deathrate'].str.replace(',', '.').astype(float)
countries['Agriculture'] = countries['Agriculture'].str.replace(',', '.').astype(float)
countries['Industry'] = countries['Industry'].str.replace(',', '.').astype(float)
countries['Service'] = countries['Service'].str.replace(',', '.').astype(float)

In [94]:
countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3,8.71,6.25,,,


## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [27]:
def q1():
    return list(sorted(countries['Region'].unique()))

q1()

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [111]:
#KBinsDiscretizer(n_bins=5, *, encode='onehot', strategy='quantile')
def q2():    
    kBinsDiscretizer = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'quantile')
    above90th = kBinsDiscretizer.fit_transform(countries[['Pop_density']])
                                                
    return len(above90th[above90th == 8])

q2()

23

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [273]:
"""
OneHotEncoder cannot process string values directly. If your nominal features are strings, then you need to first
map them into integers.

pandas.get_dummies is kind of the opposite. By default, it only converts string columns into one-hot 
representation, unless columns are specified.
"""

def q3():    
    preprocessing = pd.get_dummies(countries[['Region', 'Climate']])
    #print(preprocessing)
    
    return preprocessing.shape[1]
    
q3()

18

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [264]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]
test_df = pd.DataFrame([test_country], columns=countries.columns)
test_df.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Test Country,NEAR EAST,-0.190325,-0.323264,-0.044217,-0.275281,0.132559,-0.805485,1.011978,0.618918,1.007486,0.202399,-0.043679,-0.139297,1.31636,-0.369964,-0.61493,-0.85437,0.263445,0.571242


In [247]:
numeric_features = ['Population', 'Area', 'Pop_density', 'Coastline_ratio', 'Net_migration', 'Infant_mortality',
                    'GDP', 'Literacy', 'Phones_per_1000', 'Arable', 'Crops', 'Other', 'Birthrate', 'Deathrate',
                    'Agriculture', 'Industry', 'Service']
numeric_features

['Population',
 'Area',
 'Pop_density',
 'Coastline_ratio',
 'Net_migration',
 'Infant_mortality',
 'GDP',
 'Literacy',
 'Phones_per_1000',
 'Arable',
 'Crops',
 'Other',
 'Birthrate',
 'Deathrate',
 'Agriculture',
 'Industry',
 'Service']

In [270]:
def q4():   
    pipeline = Pipeline([("imputer", SimpleImputer(strategy='median')), ("standardScaler", StandardScaler())])
    pipeline.fit(countries[numeric_features])
    
    return float(round(pipeline.transform(test_df[numeric_features])[0][9], 3))
    
q4()

-1.047

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [209]:
def q5():
    q1,q3 = np.quantile(countries.loc[:,'Net_migration'].dropna(), [0.25, 0.75])
    iqr = q3 - q1

    upper_outliers = len([val for val in countries.loc[:,'Net_migration'] if (val > (q3 + (1.5 * iqr)))])
    lower_outliers = len([val for val in countries.loc[:,'Net_migration'] if (val < (q1 - (1.5 * iqr)))])

    return tuple([lower_outliers, upper_outliers, False])

q5()

(24, 26, False)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [211]:
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [228]:
def q6():
    countVectorizer = CountVectorizer()
    data = countVectorizer.fit_transform(newsgroup.data)
    
    return int(data[:, countVectorizer.vocabulary_['phone']].sum())
    
q6()    

213

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [234]:
def q7():
    tfidf = TfidfVectorizer()
    data = tfidf.fit_transform(newsgroup.data)
    
    return float(data[:, tfidf.vocabulary_['phone']].sum().round(3))

q7()

8.888