### 1. Install jupyter-notebook, pandas, matplotlib and scikit-learn
1. Install required packages

Execute: `pip install jupyter pandas matplotlib scikit-learn`

2. Run jupyter notebook

Execute: `jupyter notebook`

### Get pandas version

In [62]:
import pandas as pd
pd.__version__

'0.23.4'

### Opening csv file for exploration

For this example, we're working with European unemployment data from Eurostat.

Create `DateFrame` from csv file as follows:

In [63]:
df = pd.read_csv('data/country_total.csv') 
#The first 5 rows
df.head()
#df.tail()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


In [64]:
#what's the size of the dataset
df.shape

(20796, 5)

calling `.describe()` with generate a useful summary statistics. By default it only summarizes numeric data columns.

In [65]:
df.describe()

Unnamed: 0,month,unemployment,unemployment_rate
count,20796.0,20796.0,19851.0
mean,1999.40129,790081.8,8.179764
std,7.483751,1015280.0,3.922533
min,1983.01,2000.0,1.1
25%,1994.09,140000.0,5.2
50%,2001.01,310000.0,7.6
75%,2006.01,1262250.0,10.0
max,2010.12,4773000.0,20.9


### Open a different file (countries)

In [66]:
df_countries = pd.read_csv('data/countries.csv')
df_countries.head()

Unnamed: 0,country,google_country_code,country_group,name_en,name_fr,name_de,latitude,longitude
0,at,AT,eu,Austria,Autriche,Österreich,47.696554,13.34598
1,be,BE,eu,Belgium,Belgique,Belgien,50.501045,4.476674
2,bg,BG,eu,Bulgaria,Bulgarie,Bulgarien,42.725674,25.482322
3,hr,HR,non-eu,Croatia,Croatie,Kroatien,44.746643,15.340844
4,cy,CY,eu,Cyprus,Chypre,Zypern,35.129141,33.428682


In [67]:
df_countries.shape

(30, 8)

In [68]:
df_countries.describe()

Unnamed: 0,latitude,longitude
count,30.0,30.0
mean,49.092609,14.324579
std,7.956624,11.25701
min,35.129141,-8.239122
25%,43.230916,6.979186
50%,49.238087,14.941462
75%,54.0904,23.35169
max,64.950159,35.439795


In [69]:
#df_countries['country']
df_countries['country'].values

array(['at', 'be', 'bg', 'hr', 'cy', 'cz', 'dk', 'ee', 'fi', 'fr', 'de',
       'gr', 'hu', 'ie', 'it', 'lv', 'lt', 'lu', 'mt', 'nl', 'no', 'pl',
       'pt', 'ro', 'sk', 'si', 'es', 'se', 'tr', 'uk'], dtype=object)

In [70]:
df_countries['country'].describe()

count     30
unique    30
top       pt
freq       1
Name: country, dtype: object

### Can you answer the following questions?
- what columns does it contain?
- what does each row stand for?
- how many rows and columns does it contain?
- are there any missing values in the latitude or longitude columns?


In [71]:
df.head()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


In [72]:
df.rename(columns={'month':'year_month'}, inplace=True)

In [73]:
df.head()

Unnamed: 0,country,seasonality,year_month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


In [74]:
df[:5]

Unnamed: 0,country,seasonality,year_month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


### Custom DataFrame

We create bacteria DataFrame

In [75]:
bacteria = pd.DataFrame({'bacteria_counts' : [632, 1638, 569, 115],
                         'other_feature' : [438, 833, 234, 298]},
                         index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

In [76]:
bacteria

Unnamed: 0,bacteria_counts,other_feature
Firmicutes,632,438
Proteobacteria,1638,833
Actinobacteria,569,234
Bacteroidetes,115,298


In [77]:
bacteria.loc['Firmicutes']

bacteria_counts    632
other_feature      438
Name: Firmicutes, dtype: int64

In [78]:
bacteria.iloc[0]

bacteria_counts    632
other_feature      438
Name: Firmicutes, dtype: int64

In [79]:
bacteria.iloc[0:2]

Unnamed: 0,bacteria_counts,other_feature
Firmicutes,632,438
Proteobacteria,1638,833


### Change data type for a column

In [80]:
df[:2]

Unnamed: 0,country,seasonality,year_month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6


In [81]:
df['year_month'].dtype

dtype('float64')

In [82]:
df['year_month'] = df['year_month'].astype(int)
df['year_month'].dtype

dtype('int64')

### Creating a new column

In [83]:
df['ratex10'] = df['unemployment_rate'] * 10
df.head()

Unnamed: 0,country,seasonality,year_month,unemployment,unemployment_rate,ratex10
0,at,nsa,1993,171000,4.5,45.0
1,at,nsa,1993,175000,4.6,46.0
2,at,nsa,1993,166000,4.4,44.0
3,at,nsa,1993,157000,4.1,41.0
4,at,nsa,1993,147000,3.9,39.0


### Delete a column

In [84]:
df.drop(columns=['ratex10'], inplace=True)
df.head()

Unnamed: 0,country,seasonality,year_month,unemployment,unemployment_rate
0,at,nsa,1993,171000,4.5
1,at,nsa,1993,175000,4.6
2,at,nsa,1993,166000,4.4
3,at,nsa,1993,157000,4.1
4,at,nsa,1993,147000,3.9


### Selecting and filtering  values

In [87]:
selection = df['year_month']==1993

In [89]:
len(df[selection])

528

### Merging Dataframes