Pandas: introduction to data structures
https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro

### Simulating a dataframe

In [36]:

data = {"var1": [1, 2, 3, 4, 5, 6], "var2": [10, 20, 30, 40, 50, 60], "var3": ["a", "b", "c", "d", "e", "f"] }
adf = pd.DataFrame(data)


In [37]:
adf

Unnamed: 0,var1,var2,var3
0,1,10,a
1,2,20,b
2,3,30,c
3,4,40,d
4,5,50,e
5,6,60,f


In [38]:
adf.index

RangeIndex(start=0, stop=6, step=1)

### Indexing and slicing

In [39]:
adf.iloc[1:]

Unnamed: 0,var1,var2,var3
1,2,20,b
2,3,30,c
3,4,40,d
4,5,50,e
5,6,60,f


In [40]:
adf.iloc[:-1]

Unnamed: 0,var1,var2,var3
0,1,10,a
1,2,20,b
2,3,30,c
3,4,40,d
4,5,50,e


In [41]:
adf.loc[:, ["var1", "var3"]]

Unnamed: 0,var1,var3
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e
5,6,f


#### Exercise
 - Access the value 2
 - Filter the dataframe on a condition: select rows only if var2 > 30. Store the selected subset into another dataframe


### Vectorizing operations on a pandas dataframe

In [42]:
adf.var1 + adf.var2

0    11
1    22
2    33
3    44
4    55
5    66
dtype: int64

In [45]:
for index, row in adf.iterrows():
    adf.at[index, 'var4'] = row['var1'] + row['var2']

In [46]:
adf

Unnamed: 0,var1,var2,var3,var4
0,1,10,a,11.0
1,2,20,b,22.0
2,3,30,c,33.0
3,4,40,d,44.0
4,5,50,e,55.0
5,6,60,f,66.0


In [49]:
adf['var5'] = adf.apply(lambda row: row['var1'] + row['var4'], axis=1)        


In [50]:
adf

Unnamed: 0,var1,var2,var3,var4,var5
0,1,10,a,11.0,12.0
1,2,20,b,22.0,24.0
2,3,30,c,33.0,36.0
3,4,40,d,44.0,48.0
4,5,50,e,55.0,60.0
5,6,60,f,66.0,72.0


In [16]:
madf = adf * 2,5

In [26]:
adf["var1"] + 2,5

(0    3
 1    4
 2    5
 Name: var1, dtype: int64,
 5)

In [18]:
type(madf)

tuple

### Change a column type

In [66]:
print(adf.var1.dtype)
adf.var1.astype('float')

int64


0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
Name: var1, dtype: float64

## Exercise:
- turn the species column into a categorical variable, using the __.astype('category')__ method on the species column
- filter for Sepal Width > 2
- create a subset containing only Sepal Width and Sepal Length
- create a subset with Sepal Length and Species where Sepal.Width > 3
- compute the mean of all columns
- get the median of Sepal Length column
- create a new column in the dataframe wi

Import the iris dataset from scikit learn

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html



In [5]:
from sklearn import datasets
import pandas as pd

In [55]:
iris = datasets.load_iris()
import numpy as np
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])


In [62]:
iris.rename(columns={"target": "species"}, inplace=True)
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


In [63]:
iris
iris['species'] = iris.species.astype('category')


In [64]:
iris.species.dtype

CategoricalDtype(categories=[0.0, 1.0, 2.0], ordered=False)