# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    OneHotEncoder, Binarizer, KBinsDiscretizer,
    MinMaxScaler, StandardScaler, PolynomialFeatures
)
from sklearn.compose import (make_column_selector, ColumnTransformer)
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import (CountVectorizer, TfidfTransformer, TfidfVectorizer) 


In [2]:
# Algumas configurações para o matplotlib.
%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [3]:
countries = pd.read_csv("countries.csv")

In [4]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [5]:
# Ajustando colunas numéricas
str_to_num_columns = ['Pop_density', 'Coastline_ratio', 'Net_migration', 'Infant_mortality', 'Literacy', 'Phones_per_1000', 'Arable', 'Crops', 'Other', 'Birthrate', 'Deathrate', 'Agriculture', 'Industry', 'Service', 'Climate']

if countries['Pop_density'].dtype != 'float64':
    countries[str_to_num_columns] = countries[str_to_num_columns].apply(lambda column: pd.to_numeric(column.str.replace(',', '.')))
# (countries.dtypes != "object").reset_index()
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
Country             227 non-null object
Region              227 non-null object
Population          227 non-null int64
Area                227 non-null int64
Pop_density         227 non-null float64
Coastline_ratio     227 non-null float64
Net_migration       224 non-null float64
Infant_mortality    224 non-null float64
GDP                 226 non-null float64
Literacy            209 non-null float64
Phones_per_1000     223 non-null float64
Arable              225 non-null float64
Crops               225 non-null float64
Other               225 non-null float64
Climate             205 non-null float64
Birthrate           224 non-null float64
Deathrate           223 non-null float64
Agriculture         212 non-null float64
Industry            211 non-null float64
Service             212 non-null float64
dtypes: float64(16), int64(2), object(2)
memory usage: 35.5+ KB


In [27]:
# Ajustando colunas Country e Region

countries[['Country', 'Region']] = countries[['Country', 'Region']].apply(lambda column: column.str.strip()) 

countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3.0,8.71,6.25,,,


## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [7]:
def q1():
    result_1 = sorted(list(countries['Region'].value_counts().index))
    return result_1
q1()

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [10]:
def q2():
    discretizer = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
    df_pop_density = countries[['Country', 'Pop_density']]
    df_pop_density['Pop_density']= discretizer.fit_transform(df_pop_density[['Pop_density']])
    result_2 = df_pop_density[df_pop_density['Pop_density'] == 9].shape[0]
    return int(result_2)
q2()

23

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [11]:
def q3():
    unique_values = countries[['Region', 'Climate']].nunique()
    result_3 = sum(unique_values)
    return result_3
q3()

17

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [12]:


test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

len(test_country)

20

In [23]:
#Using ColumnTransformer

col_transformer = ColumnTransformer([
                    ("imputer", SimpleImputer(strategy = "median"),make_column_selector(dtype_include =                                     np.number)),
                    ("standardizer", StandardScaler(),make_column_selector(dtype_include = np.number))
                ])
pipeline_CT = col_transformer.fit(countries) # fitting the pipeline with the "training" dataset
test_country_df = pd.DataFrame([test_country], columns = countries.columns)
arable_loc = test_country_df.columns.get_loc('Arable')
result_4 = pipeline_CT.transform(test_country_df) #it returns the original data + the transformed data (concatenated)
result_4[0][18:]

array([-0.24432501, -0.33489095, -0.22884735, -0.29726002,  0.0193577 ,
       -1.0283662 , -0.96628361, -4.17888867, -1.03329473, -1.04483156,
       -0.55231618, -5.07780086, -1.17912718, -2.01625005, -1.97963876,
       -6.86380919, -0.13966246,  0.0360151 ])

In [14]:
#Using Pipeline

countries_num = countries.select_dtypes(exclude = 'object') #select only non object (numeric) columns types
num_pipeline = Pipeline([
                ("imputer", SimpleImputer(strategy = "median")),                       ("standardizer", StandardScaler())
                ]) #numeric pipeline
pipeline_fit = num_pipeline.fit(countries_num)
arable_loc = countries_num.columns.get_loc('Arable') #get arable index
test_country_df = pd.DataFrame([test_country], columns = countries.columns)
test_country_transformed = pipeline_fit.transform(test_country_df.select_dtypes(exclude = 'object'))
test_country_transformed[0]

array([-0.24432501, -0.33489095, -0.22884735, -0.29726002,  0.01959086,
       -1.02861728, -0.96623348, -4.35427242, -1.03720972, -1.04685743,
       -0.55058149, -5.10112169, -1.21812201, -2.02455164, -1.99092137,
       -7.04915046, -0.13915481,  0.03490335])

There were some little differences in the methods above.

In [68]:
def q4():
    countries_num = countries.select_dtypes(exclude = 'object') #select only non object (numeric) columns types
    num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy = "median")),                       ("standardizer", StandardScaler())
        ]) #numeric pipeline
    pipeline_fit = num_pipeline.fit(countries_num)
    arable_loc = countries_num.columns.get_loc('Arable') #get arable index
    test_country_df = pd.DataFrame([test_country], columns = countries.columns)
    test_country_transformed = pipeline_fit.transform(test_country_df.select_dtypes(exclude = 'object'))[0][arable_loc]
    result_4 = round(test_country_transformed,3)
    return float(result_4)
q4()

-1.047

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [35]:
net_migration =countries['Net_migration']
quartiles = net_migration.quantile(q = [0.25, 0.50, 0.75]).values
iqr = quartiles[2] - quartiles[0]
migration_outliers = [net_migration[net_migration < (quartiles[0] - 1.5 * iqr)], net_migration[net_migration > (quartiles[2] + 1.5 * iqr)]]
qt_outliers = [outliers.size for outliers in migration_outliers]
remove = False
sum(qt_outliers)/net_migration.size
countries[countries['Net_migration'].isin(migration_outliers[0])].head()


Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2.0,22.46,3.27,,,
7,Antigua & Barbuda,LATIN AMER. & CARIB,69108,443,156.0,34.54,-6.15,19.46,11000.0,89.0,549.9,18.18,4.55,77.27,2.0,16.93,5.37,0.038,0.22,0.743
9,Armenia,C.W. OF IND. STATES,2976372,29800,99.9,0.0,-6.47,23.28,3500.0,98.6,195.7,17.55,2.3,80.15,4.0,12.07,8.23,0.239,0.343,0.418
13,Azerbaijan,C.W. OF IND. STATES,7961619,86600,91.9,0.0,-4.9,81.74,3400.0,97.0,137.1,19.63,2.71,77.66,1.0,20.74,9.75,0.141,0.457,0.402


There is 22% of outliers (using boxplot method). However, they are all unique values (each sample representing one country) and for many reasons and possible analyses, these outliers can be very useful and important. Say we want to understand why there is such a big negative net migration for a specific country and hence find its cause (economic and social problems, war, etc)

In [17]:
def q5():
    net_migration =countries['Net_migration']
    quartiles = net_migration.quantile(q = [0.25, 0.50, 0.75]).values
    iqr = quartiles[2] - quartiles[0]
    migration_outliers = [net_migration[net_migration < (quartiles[0] - 1.5 * iqr)],                net_migration[net_migration > (quartiles[2] + 1.5 * iqr)]]
    qt_outliers = [outliers.size for outliers in migration_outliers]
    remove = False
    qt_outliers.append(remove)
    result_5 = qt_outliers
    return tuple(result_5)
q5()

(24, 26, False)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [41]:
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42) # create a dictionary with the following keys: target (numpy array of length n_samples, referencing each sample to its filename), filenames, data (list of the n samples), DESCR and target_names. For instance:
print(newsgroup.data[3])


From: todd@psgi.UUCP (Todd Doolittle)
Subject: Re:  Motorcycle Courier (Summer Job)
Distribution: world
Organization: Not an Organization
Lines: 37

In article <1poj23INN9k@west.West.Sun.COM> gaijin@ale.Japan.Sun.COM (John Little - Nihon Sun Repair Depot) writes:
>In article <8108.97.uupcb@compdyn.questor.org> \
>ryan_cousineau@compdyn.questor.org (Ryan Cousineau) writes:
>%
>% I think I've found the ultimate summer job: It's dangerous, involves
>% motorcycles, requires high speeds in traffic, and it pays well.
>% 
>% So my question is as follows: Has anyone here done this sort of work?
>% What was your experience?
>% 
[Stuff deleted]
>   Get a -good- "AtoZ" type indexed streetmap for all of the areas  you're
>   likely to work.   Always carry  plenty of black-plastic  bin liners  to

Check with the local fire department.  My buddy is a firefighter and they
have these small map books which are Amazing!  They are compact, easy to
use (no folding).  They even have a cross reference secti

In [58]:
vectorizer = CountVectorizer()
newsgroup_vectorized = vectorizer.fit_transform(newsgroup.data) 
phone_index = vectorizer.vocabulary_.get('phone')
count_phone_words = [sample[phone_index] for sample in newsgroup_vectorized.toarray()] 
#vectorization returns a sparse matrix to array, with word labeling frequency for each sample
qty_phone_words = sum(count_phone_words)
words_df = pd.DataFrame(newsgroup_vectorized.toarray(), columns = list(vectorizer.vocabulary_.keys()))
words_df.head()

Unnamed: 0,from,rubin,cis,ohio,state,edu,daniel,subject,re,what,...,lddc,docklands,litigious,securities,5656,4902,po5,163122,20454,cbfsb
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True,                    random_state=42)

In [20]:
def q6():
    vectorizer = CountVectorizer()
    newsgroup_vectorized = vectorizer.fit_transform(newsgroup.data)
    phone_index = vectorizer.vocabulary_.get('phone')
    count_phone_words = [sample[phone_index] for sample in newsgroup_vectorized.toarray()] 
    result_6 = sum(count_phone_words)
    return result_6
q6()

213

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

For each document, I have a $tf-idf$ vector with the $tf-idf_{\text{normalized}}$ value of each term in that document.

In [22]:
def q7():
    tfidf_vectorizer = TfidfVectorizer()
    newsgroups_tfidf_vect = tfidf_vectorizer.fit_transform(newsgroup.data)
    phone_index = tfidf_vectorizer.vocabulary_.get('phone')
    phone_tfidf_values = [sample[phone_index] for sample in newsgroups_tfidf_vect.toarray()]
    result_7 = round(sum(phone_tfidf_values), 3)
    return result_7
q7()

8.888