![Cabec%CC%A7alho_notebook.png](cabecalho_notebook.png)

# PCA - Tarefa 01: *HAR* com PCA

Vamos trabalhar com a base da demonstração feita em aula, mas vamos explorar um pouco melhor como é o desempenho da árvore variando o número de componentes principais.

In [8]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV

filename_features = "Dados/UCI HAR Dataset/features.txt"
filename_labels = "Dados/UCI HAR Dataset/activity_labels.txt"

filename_subtrain = "Dados/UCI HAR Dataset/train/subject_train.txt"
filename_xtrain = "Dados/UCI HAR Dataset/train/X_train.txt"
filename_ytrain = "Dados/UCI HAR Dataset/train/y_train.txt"

filename_subtest = "Dados/UCI HAR Dataset/test/subject_test.txt"
ffilename_xtest = "Dados/UCI HAR Dataset/test/X_test.txt"
filename_ytest = "Dados/UCI HAR Dataset/test/y_test.txt"

features = pd.read_csv(filename_features, header=None, names=['nome_var'], sep="#")
features = features.squeeze('columns')
labels = pd.read_csv(filename_labels, delim_whitespace=True, header=None, names=['cod_label', 'label'])

subject_train = pd.read_csv(filename_subtrain, header=None, names=['subject_id'])
subject_train = subject_train.squeeze('columns')
X_train = pd.read_csv(filename_xtrain, delim_whitespace=True, header=None, names=features.tolist())
y_train = pd.read_csv(filename_ytrain, header=None, names=['cod_label'])

subject_test = pd.read_csv(filename_subtest, header=None, names=['subject_id'])
subject_test = subject_test.squeeze('columns')
X_test = pd.read_csv(ffilename_xtest, delim_whitespace=True, header=None, names=features.tolist())
y_test = pd.read_csv(filename_ytest, header=None, names=['cod_label'])

In [9]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, random_state=23)

## Árvore de decisão

Rode uma árvore de decisão com todas as variáveis, utilizando o ```ccp_alpha=0.001```. Avalie a acurácia nas bases de treinamento e teste. Avalie o tempo de processamento.

In [86]:
%%time

clf = DecisionTreeClassifier(ccp_alpha=0.001, random_state=23).fit(X_train, y_train)

CPU times: total: 4.05 s
Wall time: 5.58 s


In [79]:
def avalia (model: pd.DataFrame, conjunto_X: list, conjunto_y:list, nomes=['treino', 'validação', 'teste'], exibir = True):
    resultados = {}
    for nome, X, y in zip(nomes, conjunto_X, conjunto_y):
        scores = model.score(X, y)
        resultados[nome]= scores
        if exibir:
            print(f'Acurácia na base de {nome}: {scores*100:.1f}' )
    return resultados

In [87]:
resultado = avalia(clf,[X_train, X_valid, X_test], [y_train, y_valid, y_test])

Acurácia na base de treino: 97.8
Acurácia na base de validação: 94.3
Acurácia na base de teste: 86.6


## Árvore com PCA

Faça uma análise de componemtes principais das variáveis originais. Utilize apenas uma componente. Faça uma árvore de decisão com esta componente como variável explicativa.

- Avalie a acurácia nas bases de treinamento e teste
- Avalie o tempo de processamento

In [81]:
%%time
prcomp = PCA().fit(X_train)

pc_treino = prcomp.transform(X_train)
pc_valida = prcomp.transform(X_valid)
pc_teste  = prcomp.transform(X_test)

pc_treino.shape

CPU times: total: 1.5 s
Wall time: 1.2 s


(5514, 561)

In [82]:
n=1

colunas = ['cp'+str(x+1) for x in list(range(n))]

pc_train = pd.DataFrame(pc_treino[:,:n], columns = colunas)
pc_valid = pd.DataFrame(pc_valida[:,:n], columns = colunas)
pc_test  = pd.DataFrame(pc_teste[:,:n], columns = colunas)

pc_train.head()

Unnamed: 0,cp1
0,-6.121259
1,-5.184451
2,-2.47383
3,-5.826141
4,-5.799838


In [88]:
%%time
clf = DecisionTreeClassifier(ccp_alpha=0.001, random_state=23).fit(pc_train, y_train)

CPU times: total: 62.5 ms
Wall time: 69.5 ms


In [89]:
acuracia = avalia(clf, [pc_train, pc_valid, pc_test], [y_train, y_valid, y_test])

Acurácia na base de treino: 50.2
Acurácia na base de validação: 49.1
Acurácia na base de teste: 45.4


A acurácia diminuiu em relação a primeira árvore treinada com todas as componentes.

## Testando o número de componentes

Com base no código acima, teste a árvore de classificação com pelo menos as seguintes possibilidades de quantidades de componentes: ```[1, 2, 5, 10, 50]```. Avalie para cada uma delas:

- Acurácia nas bases de treino e teste
- Tempo de processamento


In [95]:
%%time
comp = [1, 2, 5, 10, 25, 50, 100, 150, 200]
bases = {}

for n in comp:
    pca = PCA(n_components=n)
    pca.fit(X_train)
    
    pca_treino = pca.transform(X_train)
    pca_valida = pca.transform(X_valid)
    pca_teste = pca.transform(X_test)
    
    colunas = [f'cp{i+1}' for i in range(n)]
    
    bases[f'pca_train_{n}'] = pd.DataFrame(pca_treino, columns=colunas)
    bases[f'pca_valid_{n}'] = pd.DataFrame(pca_valida, columns=colunas)
    bases[f'pca_test_{n}'] = pd.DataFrame(pca_teste, columns=colunas)

CPU times: total: 8.12 s
Wall time: 5.26 s


Visualização das bases criadas:

In [96]:
for chave, df in bases.items():
    if 'pca_train' in chave:
        print(f'\n{chave}:')
        display(df.head())


pca_train_1:


Unnamed: 0,cp1
0,-6.121259
1,-5.184451
2,-2.47383
3,-5.826141
4,-5.799838



pca_train_2:


Unnamed: 0,cp1,cp2
0,-6.121259,2.296441
1,-5.184451,-0.71284
2,-2.47383,-2.994641
3,-5.826141,0.308928
4,-5.799838,1.32034



pca_train_5:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5
0,-6.121259,2.296441,-2.146628,0.572642,0.106557
1,-5.184451,-0.71284,-1.38814,0.420117,0.625489
2,-2.47383,-2.994641,3.373997,0.373215,1.848003
3,-5.826141,0.308928,2.47289,0.398034,0.381323
4,-5.799838,1.32034,-1.73292,0.075929,-0.433482



pca_train_10:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5,cp6,cp7,cp8,cp9,cp10
0,-6.121259,2.296441,-2.146628,0.572643,0.106547,-1.008291,-0.245372,-0.322101,-0.53181,0.074976
1,-5.184451,-0.71284,-1.38814,0.420123,0.625455,0.651474,-0.501915,0.303399,0.383313,0.180506
2,-2.47383,-2.994641,3.373997,0.373218,1.847942,1.265886,1.169202,-0.515627,-0.544169,-0.85813
3,-5.826141,0.308928,2.47289,0.398018,0.381349,0.204018,0.842278,0.231618,-0.716634,0.293706
4,-5.799838,1.32034,-1.73292,0.07593,-0.433478,-0.358014,-0.693638,0.5148,0.413125,0.248118



pca_train_25:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5,cp6,cp7,cp8,cp9,cp10,...,cp16,cp17,cp18,cp19,cp20,cp21,cp22,cp23,cp24,cp25
0,-6.121259,2.296441,-2.146628,0.572644,0.106552,-1.008276,-0.24533,-0.322493,-0.53215,0.073746,...,-0.452761,0.270094,0.432482,-0.210396,-0.533992,-0.166537,-0.622673,0.120527,-0.169471,-0.175215
1,-5.184451,-0.71284,-1.38814,0.420122,0.62545,0.651448,-0.501937,0.303653,0.383501,0.181302,...,-0.292811,-0.37385,-0.342156,-0.097599,0.932677,0.008503,0.390933,0.204113,-0.217382,0.393345
2,-2.47383,-2.994641,3.373997,0.373222,1.847946,1.26587,1.169314,-0.516178,-0.544656,-0.861306,...,-0.251255,-0.289017,-0.17805,0.05102,-0.500123,-0.6527,1.022417,0.685469,-0.112285,-0.019873
3,-5.826141,0.308928,2.47289,0.39802,0.381351,0.204012,0.842363,0.231421,-0.716723,0.291934,...,-0.662334,0.977965,-0.299132,-0.066303,-0.251373,-0.357778,0.553538,0.805858,0.216428,-0.054046
4,-5.799838,1.32034,-1.73292,0.075927,-0.433481,-0.35801,-0.693687,0.515135,0.413586,0.2501,...,0.637861,0.205377,-0.200147,-0.108625,0.348087,-0.099287,-0.454627,-0.223439,0.253188,0.145836



pca_train_50:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5,cp6,cp7,cp8,cp9,cp10,...,cp41,cp42,cp43,cp44,cp45,cp46,cp47,cp48,cp49,cp50
0,-6.121259,2.296441,-2.146628,0.572644,0.106552,-1.008276,-0.24533,-0.322491,-0.532149,0.073749,...,0.291897,0.168519,0.24472,0.053447,0.259841,-0.25398,-0.065936,0.15766,0.0718,0.028187
1,-5.184451,-0.71284,-1.38814,0.420122,0.62545,0.651448,-0.501937,0.303652,0.3835,0.181302,...,-0.373057,-0.205683,-0.18062,-0.250283,0.25016,0.291714,0.193355,-0.306728,-0.295616,-0.085696
2,-2.47383,-2.994641,3.373997,0.373222,1.847946,1.26587,1.169315,-0.516178,-0.544651,-0.861301,...,0.325825,0.427019,-0.330404,0.2865,1.224233,-0.115947,1.148312,0.632758,-0.845281,-0.462008
3,-5.826141,0.308928,2.47289,0.39802,0.381351,0.204011,0.842363,0.23142,-0.716721,0.291936,...,-0.50336,-0.105864,-0.155609,-0.074636,-0.359203,-0.082036,-0.08903,-0.613513,-0.083448,0.081467
4,-5.799838,1.32034,-1.73292,0.075927,-0.433481,-0.35801,-0.693687,0.515134,0.413586,0.250095,...,-0.338614,-0.234646,0.354221,-0.275585,0.105639,0.149685,0.06993,-0.066378,0.193072,0.026358



pca_train_100:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5,cp6,cp7,cp8,cp9,cp10,...,cp91,cp92,cp93,cp94,cp95,cp96,cp97,cp98,cp99,cp100
0,-6.121259,2.296441,-2.146628,0.572644,0.106552,-1.008276,-0.24533,-0.322491,-0.532149,0.073749,...,0.181175,0.157361,0.175665,0.172103,0.004488,0.050417,-0.000719,0.130851,-0.016783,0.116808
1,-5.184451,-0.71284,-1.38814,0.420122,0.62545,0.651448,-0.501937,0.303652,0.3835,0.181302,...,-0.152057,-0.072826,-0.230138,-0.146181,-0.121885,0.180631,0.087919,0.130696,0.076996,-0.072317
2,-2.47383,-2.994641,3.373997,0.373222,1.847946,1.26587,1.169315,-0.516178,-0.544651,-0.861302,...,-0.00086,0.09523,0.418222,-0.204705,-0.241493,0.442552,-0.426342,0.057879,0.393264,-0.015882
3,-5.826141,0.308928,2.47289,0.39802,0.381351,0.204011,0.842363,0.23142,-0.716721,0.291936,...,-0.016883,0.129614,0.222267,-0.045908,0.131979,-0.085853,0.275211,0.142921,-0.232475,-0.354718
4,-5.799838,1.32034,-1.73292,0.075927,-0.433481,-0.35801,-0.693687,0.515134,0.413586,0.250095,...,-0.249886,0.084984,-0.089368,0.016256,0.13116,-0.109179,0.144886,-0.108467,0.000577,-0.279894



pca_train_150:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5,cp6,cp7,cp8,cp9,cp10,...,cp141,cp142,cp143,cp144,cp145,cp146,cp147,cp148,cp149,cp150
0,-6.121259,2.296441,-2.146628,0.572644,0.106552,-1.008276,-0.24533,-0.322491,-0.532149,0.073749,...,0.005718,-0.08394,0.001358,0.073941,0.064888,-0.052149,0.051819,0.066455,0.042022,0.032921
1,-5.184451,-0.71284,-1.38814,0.420122,0.62545,0.651448,-0.501937,0.303652,0.3835,0.181302,...,0.026033,-0.056348,-0.077746,-0.060197,0.049168,-0.074898,-0.053926,-0.077322,-0.044596,0.01468
2,-2.47383,-2.994641,3.373997,0.373222,1.847946,1.26587,1.169315,-0.516178,-0.544651,-0.861301,...,0.026585,-0.146141,-0.106934,0.108006,-0.094004,-0.263653,-0.270327,-0.042242,0.178139,-0.148572
3,-5.826141,0.308928,2.47289,0.39802,0.381351,0.204011,0.842363,0.23142,-0.716721,0.291936,...,-0.092757,0.011628,0.076344,-0.005688,0.062772,-0.008777,-0.099493,0.058262,0.003344,0.030763
4,-5.799838,1.32034,-1.73292,0.075927,-0.433481,-0.35801,-0.693687,0.515134,0.413586,0.250095,...,-0.004354,0.015963,0.13438,-0.021854,0.009291,0.019472,-0.03431,0.09644,-0.133192,0.041774



pca_train_200:


Unnamed: 0,cp1,cp2,cp3,cp4,cp5,cp6,cp7,cp8,cp9,cp10,...,cp191,cp192,cp193,cp194,cp195,cp196,cp197,cp198,cp199,cp200
0,-6.121259,2.296441,-2.146628,0.572644,0.106552,-1.008276,-0.24533,-0.322491,-0.532149,0.073749,...,0.020338,0.015389,-0.014784,0.0423,-0.020025,0.009787,0.03534,-0.004165,-0.018027,-0.027548
1,-5.184451,-0.71284,-1.38814,0.420122,0.62545,0.651448,-0.501937,0.303652,0.3835,0.181302,...,-0.03627,0.033391,0.031197,-0.040088,0.075635,0.003599,-0.013064,0.013236,0.024967,0.014827
2,-2.47383,-2.994641,3.373997,0.373222,1.847946,1.26587,1.169315,-0.516178,-0.544651,-0.861301,...,0.136822,-0.031469,-0.027008,0.085444,0.096142,-0.025736,-0.078698,0.148236,0.152459,0.291425
3,-5.826141,0.308928,2.47289,0.39802,0.381351,0.204011,0.842363,0.23142,-0.716721,0.291936,...,0.006961,-0.03469,-0.022807,-0.024974,0.007982,-0.009884,0.015749,-0.033388,-0.032038,0.02463
4,-5.799838,1.32034,-1.73292,0.075927,-0.433481,-0.35801,-0.693687,0.515134,0.413586,0.250095,...,-0.016742,0.005988,-0.008124,-0.0115,-0.011013,-0.006866,-0.009906,-0.000341,0.060509,-0.008085


In [103]:
%%time
resultado_acc = {}
for n in comp:
    
    X_treino = bases[f'pca_train_{n}']
    X_valida = bases[f'pca_valid_{n}']
    X_teste = bases[f'pca_test_{n}']
    
    clf = DecisionTreeClassifier(ccp_alpha=0.001, random_state=23).fit(X_treino, y_train)
    
    avaliacao = avalia(clf, [X_treino, X_valida, X_teste], [y_train, y_valid, y_test], exibir=False)
    
    resultado_acc[f'{n}_comp'] = avaliacao

CPU times: total: 6.92 s
Wall time: 7.19 s


In [104]:
score = pd.DataFrame(resultado_acc).T

## Conclua

- O que aconteceu com a acurácia?
- O que aconteceu com o tempo de processamento?

In [105]:
score

Unnamed: 0,treino,validação,teste
1_comp,0.501632,0.490751,0.454021
2_comp,0.617338,0.595212,0.578554
5_comp,0.850925,0.831338,0.783509
10_comp,0.895357,0.858542,0.80828
25_comp,0.912768,0.869423,0.799796
50_comp,0.931266,0.861806,0.803529
100_comp,0.937976,0.87704,0.796742
150_comp,0.941603,0.873776,0.799457
200_comp,0.941059,0.872144,0.791992


É possível observar o fenômeno de Huges acontecendo. Conforme o número de componentes aumenta o valor da acurácia cai, tanto para a base de validação quanto a base de teste.

Quanto ao tempo, a árvore treinada com todas as componentes demorou 4 segundos, enquanto a árvore treinada apenas com 1 componente demorou 62 milisegundos. Já o treinamento das árvores com os componentes 1, 2, 5, 10, 25, 50, 100, 150 e 200 demorou apenas 7 segundos. O que mostra a eficiência da utilização do PCA.