# First steps - and a little of statistics - in pandas

In the proper's pandas words:

``pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.''

## 1. Importing pandas

In [1]:
import pandas as pd

## 2. Creating and visualizing tabular data

In **pandas** we work with tabular data, more specifically, a **DataFrame**. A **DataFrame** is composed of _rows_ and _columns_ that are easily represented by its _labels_. If you want you can say that each column is a `series` and a sequence of series forms the DataFrame.

Formally speaking, a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) is a two-dimensional, size-mutable, potentially heterogeneous tabular data.

### 2.1 Using python dictionary

Writting the dictionary:

In [14]:
dogs = {
    'Breed': ['Akita', 'Beagle', 'Collie'],
    'Group': ['Working', 'Hound', 'Hound'],
    'Weight': [45, 11, 22],
    'Life span': [11, 14, 12]
}

Creating data frame:

In [15]:
df_dogs = pd.DataFrame(dogs)

Visualizing data frame:

In [16]:
df_dogs

Unnamed: 0,Breed,Group,Weight,Life span
0,Akita,Working,45,11
1,Beagle,Hound,11,14
2,Collie,Hound,22,12


Notice that the first column have only numbers. They are the DataFrame **indexes**. To create your own index column you can just pass:

In [52]:
df_dogs = pd.DataFrame(dogs, index = ['a', 'b', 'c'])

In [53]:
df_dogs

Unnamed: 0,Breed,Group,Weight,Life span
a,Akita,Working,45,11
b,Beagle,Hound,11,14
c,Collie,Hound,22,12


Or you can use the method `index` as:

In [54]:
df_dogs.index = [0, 1, 2]
df_dogs

Unnamed: 0,Breed,Group,Weight,Life span
0,Akita,Working,45,11
1,Beagle,Hound,11,14
2,Collie,Hound,22,12


### 2.2 Using lists

Writting the list:

In [19]:
cats = [['Abyssinian', 3.6, 12], ['Bengal', 6.4, 14], ['Manx', 5, 11]]

Creating the dictionary:

In [20]:
df_cats = pd.DataFrame(cats, columns = ['Breed', 'Weight', 'Life span'])

Note that we have used the variable `columns` as a list giving the readers for each data frame column.

In [21]:
df_cats

Unnamed: 0,Breed,Weight,Life span
0,Abyssinian,3.6,12
1,Bengal,6.4,14
2,Manx,5.0,11


If you just forget the readers:

In [50]:
df_cats = pd.DataFrame(cats)
df_cats

Unnamed: 0,0,1,2
0,Abyssinian,3.6,12
1,Bengal,6.4,14
2,Manx,5.0,11


You can include them using the method `columns`:

In [51]:
headers = ['Breed', 'Weight', 'Life span']
df_cats.columns = headers
df_cats

Unnamed: 0,Breed,Weight,Life span
0,Abyssinian,3.6,12
1,Bengal,6.4,14
2,Manx,5.0,11


### 2.3 Using numpy arrays

Just importing **numpy**:

In [22]:
import numpy as np

Creating data:

In [106]:
x = np.linspace(0, 10, 100)
y = np.logspace(10, 1000, 100)
fx = 4.*x + 10
gy = 100*(y**2) - 500

  return _nx.power(base, y)
  after removing the cwd from sys.path.


Generating the data frame:

In [107]:
data = np.array([x, y, fx, gy]).T
columns = ['x', 'y', 'f (x)', 'g (y)']
df_arrays = pd.DataFrame(data, columns = columns)

Visualizing just a part of the data frame using the method `head`:

In [110]:
df_arrays.head()

Unnamed: 0,x,y,f (x),g (y)
0,0.0,10000000000.0,10.0,1e+22
1,0.10101,1e+20,10.40404,1e+42
2,0.20202,1e+30,10.808081,1e+62
3,0.30303,9.999999999999999e+39,11.212121,1e+82
4,0.40404,1e+50,11.616162,1.0000000000000001e+102


If you want to see just a specific number of rows, give it as the argument of `head`:

In [111]:
df_arrays.head(3)

Unnamed: 0,x,y,f (x),g (y)
0,0.0,10000000000.0,10.0,1e+22
1,0.10101,1e+20,10.40404,1e+42
2,0.20202,1e+30,10.808081,1e+62


For instance, if you are interested in the bottom of the data frame, use the method `tail`:

In [112]:
df_arrays.tail()

Unnamed: 0,x,y,f (x),g (y)
95,9.59596,inf,48.383838,inf
96,9.69697,inf,48.787879,inf
97,9.79798,inf,49.191919,inf
98,9.89899,inf,49.59596,inf
99,10.0,inf,50.0,inf


Note that there is $\infty$ values in this data set. We are handle then in a minutes.

## 3. Saving data frame to a csv

Using the method `save_to_csv` you can save your data as a readable `csv` file:

In [104]:
df_arrays.to_csv('df_arrays.csv')

## 4. Importing and visualizing data frames

### 4.1 From a link

Here we are calling for the [Wine Quality Data Set](http://archive.ics.uci.edu/ml/datasets/Wine+Quality) just calling the link:

In [66]:
path = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'

Creating the data frame using the method `read_csv`:

In [68]:
df_wine = pd.read_csv(path, sep=';')

In [69]:
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


See that this `csv` file are not _comma_ _separeted_, it has _semicolon_ instead and need to be 'fixed' before starting working with it. Then, I have used `sep = ';'` as option into `read_csv`.

### 4.2 From `scikitlearn.datasets`

Python's library [`scikit learn`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) has an amazing data set library to Data Science practice. But it does not come directly as `csv` format. Then, we need to convet it as a simple task. To use it you need to lad the libary:

In [87]:
from sklearn.datasets import load_boston

Loading data, that is in `Bunch` format:

In [88]:
boston = load_boston()
type(boston)

sklearn.utils.Bunch

We have a description of this dataset using `DESCR`:

In [94]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

Using `feature names` method we can extract exactly the feature names:

In [92]:
features = boston.feature_names
features

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

Using `data` method we have all data as an `numpy array`:

In [96]:
boston_data = boston.data
boston_data.shape

(506, 13)

Then, we can convert to a data frame doing:

In [97]:
df_boston = pd.DataFrame(boston_data, columns = features)

And we got:

In [98]:
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [None]:
df_boston.to_csv()

## 5. Selecting and Replacing Data

### 5.1 Missing values

In data science we call `NaN` values as **missing values**. Sometimes, they are just $\pm \infty$. Then, we can replace them by `NaN` doing:

In [122]:
df_arrays = df_arrays.replace([np.inf, -np.inf], np.nan)

We can check if there is `NaN` values using:

In [123]:
df_arrays.isnull()

Unnamed: 0,x,y,f (x),g (y)
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
95,False,True,False,True
96,False,True,False,True
97,False,True,False,True
98,False,True,False,True


Finally, we can replace those values using the method `fillna` and choosing, for example, $0$ to replace its values.

In [126]:
df_arrays = df_arrays.fillna(0) 
df_arrays.tail()

Unnamed: 0,x,y,f (x),g (y)
95,9.59596,0.0,48.383838,0.0
96,9.69697,0.0,48.787879,0.0
97,9.79798,0.0,49.191919,0.0
98,9.89899,0.0,49.59596,0.0
99,10.0,0.0,50.0,0.0


### 5.2 Data selection

We can select some data, throwing away the same that we don't need in your data analysis. For instance, we can keep just $x > 5$ in `df_arrays` doing:

In [127]:
df_arrays = df_arrays[df_arrays['x'] < 5]

Then, if we take a look:

In [128]:
df_arrays.tail()

Unnamed: 0,x,y,f (x),g (y)
45,4.545455,0.0,28.181818,0.0
46,4.646465,0.0,28.585859,0.0
47,4.747475,0.0,28.989899,0.0
48,4.848485,0.0,29.393939,0.0
49,4.949495,0.0,29.79798,0.0
