# Aula 07 - O processo de análise de dados

## 07.01 - Visão geral do processo de análise de dados

### Etapa 1: Fazer perguntas (Question)
Ou você recebe os dados e faz perguntas baseadas neles ou faz as perguntas primeiro e, depois, reúne os dados com base nelas. Em ambos os casos, ótimas perguntas ajudam você a focar nas partes relevantes de seus dados e direcionar sua análise para insights significativos.

### Etapa 2: Limpar os dados (Wrangle)
Você obtém os dados de que precisa de uma forma que consegue trabalhar em três etapas:
- reunir (gather);
- avaliar (asses); e
- limpar (clean).

Você reúne os dados de que precisa para responder às perguntas, avalia seus dados para identificar qualquer problema na qualidade ou estrutura deles e os limpa modificando, substituindo ou removendo dados para garantir que seu conjunto de dados seja da mais alta qualidade e o mais estruturado possível.
 
### Etapa 3: Realizar a análise exploratória de dados (do inglês, exploratory data analysis, ou EDA) (Explore)
Você explora e, em seguida, aumenta seus dados para maximizar o potencial de suas análises, visualizações e modelos. Explorar envolve encontrar padrões em seus dados e visualizar relações entre eles e ter intuições sobre o assunto em que você está trabalhando. Após explorar, você pode fazer coisas como remover outliers e criar melhores recursos com seus dados, também conhecido como engenharia de recursos (feature engineering).

### Etapa 4: Tirar conclusões (ou até fazer previsões) (Draw conclusions)
Esta etapa é geralmente abordada com machine learning ou estatísticas inferenciais que estão além do escopo desse curso, que focará em tirar conclusões com estatística descritiva.


### Etapa 5: Comunicar os resultados (Communicate)
Você frequentemente precisa justificar e transmitir significado nos insights que encontrou. Ou, se seu objetivo final é construir um sistema, você normalmente precisa compartilhar o que construiu, explicar como chegou às decisões em relação ao design e reportar seu nível de desempenho. Existem muitas formas de comunicar seus resultados: relatórios, slides, postagens de blog, e-mails, apresentações ou até mesmo conversas. A visualização dos dados será sempre muito valiosa.

## 07.02 - Visão geral sobre pacotes

### NumPy
Da documentação oficial de Numpy (http://www.numpy.org/) , temos:

> NumPy is the fundamental package for scientific computing with Python. It contains among other things:
>
> - a powerful N-dimensional array object
> - sophisticated (broadcasting) functions
> - tools for integrating C/C++ and Fortran code
> - useful linear algebra, Fourier transform, and random number capabilities
> 
> Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

### Pandas
Da documentação oficial de pandas (https://pandas.pydata.org/), temos:

> Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

### Matplotlib
Da documentação oficial do Matplotlib (https://matplotlib.org/), temos:

> Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
> Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery.
> For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

### Scikit Learn
Da documentação oficial do Scikit Learn (http://scikit-learn.org/stable/), temos:

> 
> - Simple and efficient tools for data mining and data analysis
> - Accessible to everybody, and reusable in various contexts
> - Built on NumPy, SciPy, and matplotlib
> - Open source, commercially usable - BSD license


### IPython
Da documentação oficial do IPython (https://ipython.org/), temos:

> IPython provides a rich architecture for interactive computing with:
> - A powerful interactive shell.
> - A kernel for Jupyter.
> - Support for interactive data visualization and use of GUI toolkits.
> - Flexible, embeddable interpreters to load into your own projects.
> - Easy to use, high performance tools for parallel computing.


### SciPy
SciPy é um conjunto de pacotes para uso em engenharia, ciência e matemática. Contém o NumPy, Matplotlib, IPython, Simpy e pandas. (https://www.scipy.org/)


## 07.03 - Lendo e gravando CSV



In [12]:
import pandas as pd

df = pd.read_csv("bases/alunos_notas.csv")
df.head()

Unnamed: 0,registro,Nome,Frq,HW,Teste1,Projeto1,Test2,Projeto2,Final
0,27604,José,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alexandra,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Averistéia,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kristiano,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Ricado,0.32,0.85,98.0,100.0,73.0,82.0,88.0


### Teste com dataset pego em http://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance
Eu modifiquei o arquivo original. Fiquei somente com a aba com todos os períodos, apaguei os dados de saída não normatizados, salvei em CSV e substitui a vírgula como símbolo decimal para ponto. Por fim, o arquivo está salvo no subdiretório bases com o nome "stock_portfolio_dataset_modified.csv".

#### Passo 1: acessando dados

In [13]:
dados_portfolio = pd.read_csv("bases/stock_portfolio_dataset_modified.csv", header = 1)
dados_portfolio.head()

Unnamed: 0,Large B/P,Large ROE,Large S/P,Large Return Rate in the last quarter,Large Market Value,Small systematic Risk,Annual Return,Excess Return,Systematic Risk,Total Risk,Abs. Win Rate,Rel. Win Rate
0,1.0,0.0,0.0,0.0,0.0,0.0,0.532,0.478,0.738,0.8,0.52,0.412
1,0.0,1.0,0.0,0.0,0.0,0.0,0.55,0.488,0.572,0.412,0.52,0.765
2,0.0,0.0,1.0,0.0,0.0,0.0,0.693,0.63,0.703,0.757,0.44,0.376
3,0.0,0.0,0.0,1.0,0.0,0.0,0.324,0.256,0.8,0.756,0.36,0.271
4,0.0,0.0,0.0,0.0,1.0,0.0,0.327,0.307,0.432,0.209,0.72,0.447


#### Passo 2: gravando em outro CSV

In [15]:
dados_portfolio.to_csv("bases/stock_portfolio_dataset_NOVO.csv")

## 07.04 - Avaliando a intuição
Uso do pandas para avaliação preliminar do dataset.

In [19]:
dados_portfolio = pd.read_csv("bases/stock_portfolio_dataset_modified.csv", header = 1)

# isto retorna uma tupla com as mesmas dimensões do dataframe
dados_portfolio.shape

(63, 12)

In [21]:
# isto retorna os tipos de dados das colunas
dados_portfolio.dtypes

 Large B/P                                 float64
 Large ROE                                 float64
 Large S/P                                 float64
 Large Return Rate in the last quarter     float64
 Large Market Value                        float64
 Small systematic Risk                     float64
Annual Return                              float64
Excess Return                              float64
Systematic Risk                            float64
Total Risk                                 float64
Abs. Win Rate                              float64
Rel. Win Rate                              float64
dtype: object

In [22]:
# isto exibe um resumo conciso do dataframe,
# incluindo o número de valores não-nulos em cada coluna
dados_portfolio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 12 columns):
 Large B/P                                 63 non-null float64
 Large ROE                                 63 non-null float64
 Large S/P                                 63 non-null float64
 Large Return Rate in the last quarter     63 non-null float64
 Large Market Value                        63 non-null float64
 Small systematic Risk                     63 non-null float64
Annual Return                              63 non-null float64
Excess Return                              63 non-null float64
Systematic Risk                            63 non-null float64
Total Risk                                 63 non-null float64
Abs. Win Rate                              63 non-null float64
Rel. Win Rate                              63 non-null float64
dtypes: float64(12)
memory usage: 6.0 KB


In [25]:
# isto retorna o número de valores únicos em cada coluna
dados_portfolio.nunique()

 Large B/P                                  7
 Large ROE                                  7
 Large S/P                                  7
 Large Return Rate in the last quarter      7
 Large Market Value                         7
 Small systematic Risk                      7
Annual Return                              57
Excess Return                              60
Systematic Risk                            58
Total Risk                                 59
Abs. Win Rate                              14
Rel. Win Rate                              18
dtype: int64

In [26]:
# isto retorna estatísticas descritivas úteis para cada coluna de dados
dados_portfolio.describe()

Unnamed: 0,Large B/P,Large ROE,Large S/P,Large Return Rate in the last quarter,Large Market Value,Small systematic Risk,Annual Return,Excess Return,Systematic Risk,Total Risk,Abs. Win Rate,Rel. Win Rate
count,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0
mean,0.166619,0.166619,0.166619,0.166619,0.166619,0.166619,0.580143,0.576127,0.426476,0.391635,0.566984,0.547889
std,0.199304,0.199304,0.199304,0.199304,0.199304,0.199304,0.133384,0.13701,0.118235,0.136714,0.112803,0.159478
min,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.2,0.2,0.2,0.2
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.5255,0.519,0.3585,0.297,0.52,0.412
50%,0.167,0.167,0.167,0.167,0.167,0.167,0.599,0.587,0.403,0.369,0.56,0.553
75%,0.2915,0.2915,0.2915,0.2915,0.2915,0.2915,0.6795,0.669,0.4705,0.4575,0.64,0.694
max,1.0,1.0,1.0,1.0,1.0,1.0,0.8,0.8,0.8,0.8,0.8,0.8


In [28]:
# isto retorna as primeiras linhas do nosso dataframe
# como padrão, retorna as primeiras cinco linhas
# embora você possa especificar quantas linhas você gostaria que fossem retornadas
dados_portfolio.head(20)

Unnamed: 0,Large B/P,Large ROE,Large S/P,Large Return Rate in the last quarter,Large Market Value,Small systematic Risk,Annual Return,Excess Return,Systematic Risk,Total Risk,Abs. Win Rate,Rel. Win Rate
0,1.0,0.0,0.0,0.0,0.0,0.0,0.532,0.478,0.738,0.8,0.52,0.412
1,0.0,1.0,0.0,0.0,0.0,0.0,0.55,0.488,0.572,0.412,0.52,0.765
2,0.0,0.0,1.0,0.0,0.0,0.0,0.693,0.63,0.703,0.757,0.44,0.376
3,0.0,0.0,0.0,1.0,0.0,0.0,0.324,0.256,0.8,0.756,0.36,0.271
4,0.0,0.0,0.0,0.0,1.0,0.0,0.327,0.307,0.432,0.209,0.72,0.447
5,0.0,0.0,0.0,0.0,0.0,1.0,0.2,0.2,0.491,0.429,0.2,0.235
6,0.5,0.5,0.0,0.0,0.0,0.0,0.8,0.8,0.36,0.375,0.76,0.729
7,0.5,0.0,0.5,0.0,0.0,0.0,0.652,0.594,0.724,0.797,0.48,0.376
8,0.0,0.5,0.5,0.0,0.0,0.0,0.751,0.721,0.454,0.418,0.56,0.765
9,0.5,0.0,0.0,0.5,0.0,0.0,0.585,0.587,0.439,0.46,0.64,0.376


In [29]:
# isso também se aplicar ao comando `.tail()` que retorna as últimas linhas do dataframe
dados_portfolio.tail(2)

Unnamed: 0,Large B/P,Large ROE,Large S/P,Large Return Rate in the last quarter,Large Market Value,Small systematic Risk,Annual Return,Excess Return,Systematic Risk,Total Risk,Abs. Win Rate,Rel. Win Rate
61,0.0,0.2,0.2,0.2,0.2,0.2,0.636,0.634,0.362,0.277,0.6,0.8
62,0.167,0.167,0.167,0.167,0.167,0.167,0.693,0.716,0.296,0.285,0.68,0.8


In [30]:
# Exibir o índice e rótulo de cada coluna
for i, v in enumerate(dados_portfolio.columns):
    print(i, v)

0  Large B/P 
1  Large ROE 
2  Large S/P 
3  Large Return Rate in the last quarter 
4  Large Market Value 
5  Small systematic Risk
6 Annual Return
7 Excess Return
8 Systematic Risk
9 Total Risk
10 Abs. Win Rate
11 Rel. Win Rate
