In [10]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


![](https://imgs.xkcd.com/comics/norm_normal_file_format.png)

# Preprocessing

* CLI
  * Programs and pipes
* Pandas
  * Working with DataFrames
  * Handling missing values
  * Visualisations

## CLI

Can run directly in your notebook
  * Windows: make sure you are running your notebook through Git Bash

In [None]:
!ls

In [None]:
!wget https://data.kk.dk/dataset/76ecf368-bf2d-46a2-bcf8-adaf37662528/resource/9286af17-f74e-46c9-a428-9fb707542189/download/befkbhalderstatkode.csv

In [None]:
!ls

In [None]:
!head befkbhalderstatkode.csv

## Pandas

A library for working with data in a nicer way.

* Based on https://jakevdp.github.io/PythonDataScienceHandbook/



## Pandas vs Numpy
1. In pandas we have 1D Series and 2D DataFrame in numpy we have multi dimensional ndArrays
2. In DataFrame we have column names (like in sql) in ndArrays we are data slicing based in indices
3. In DataFrame we can have multiple datatypes in different columns
![](pandas_vs_numpy.png)

In [None]:
import pandas as pd

## Importing data into Pandas


In [None]:
pd.read_csv('befkbhalderstatkode.csv')

In [None]:
df = pd.read_csv('befkbhalderstatkode.csv')

In [None]:
df = pd.read_csv('befkbhalderstatkode.csv')
df.head()

* What are the columns in the dataset?

In [None]:
df.columns

In [None]:
type(df)

* What is the dimensionality of the dataframe?

In [None]:
df.shape

In [None]:
type(df['ALDER'])

In [None]:
type(df.iloc[:,0])

## `Series`

A `Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series

You can create a Series by passing a list of values, letting Pandas create a default integer index.

In [None]:
s = pd.Series([1, 3, 5, np.nan, 'seks', 8])
print(s,'\n---------------------')
s = pd.Series(['seks','fem','fire'], index=[6,5,4])
print(s)

# `DataFrame`

Since `Series` are one-dimensional arrays, we have to create a `DataFrame` if we wanted to combine our two previous `Series` objects `ts_dk` and `ts_ur`.Since `Series` are one-dimensional arrays, we have to create a `DataFrame` if we wanted to combine our two previous `Series` objects `ts_dk` and `ts_ur`. 

A `DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or **a dict of Series objects**.

In the following we concatenate two `Series`to form a `DataFrame`.

We will use pandas concat() method [get a good explanation here](https://www.tutorialspoint.com/python_pandas/python_pandas_concatenation.htm)

In [None]:
# World Bank dataset on CO2 emissions
# See http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.KT?downloadformat=csv
df = pd.read_csv('API_EN.ATM.CO2E.KT_DS2_en_csv_v2_10473877.csv')

In [None]:
df

In [None]:
df['Country Name'] # Pandas column style

In [None]:
df.loc[0] # Numpy row style

In [None]:
df.loc[0][0] # Gets the first row and the first element

### Pandas exercise

* Download the `befkbhalderstatkode.csv` dataset from either GitHub or using `wget`
* Load it into Pandas with `read_csv()`
* Find the second element in the third column

## Working with missing data


In [None]:
vals = np.array([1, np.nan, 3, 4]) 
vals.dtype

In [None]:
df[df['Country Name'] == 'Ukraine']

In [None]:
part = df[df['Country Name'] == 'Ukraine']

In [None]:
part

In [None]:
part.isna()

In [None]:
part.dropna()

In [None]:
part.dropna?

In [None]:
part.dropna(axis=1)


In [None]:
part.fillna(0)

## Using pandas for visualisation

Let's plot years (x-axis) against CO2 emissions.

In [None]:
ukraine = df[df['Country Name'] == 'Ukraine']

In [None]:
ukraine.iloc[0]

In [None]:
ukraine = ukraine.iloc[0][4:]
ukraine

In [None]:
ukraine.plot()

In [None]:
ukraine[ukraine > 0].plot()

In [None]:
denmark = df[df['Country Name'] == 'Denmark'].iloc[0][4:]

In [None]:
denmark = denmark.dropna()
denmark

In [None]:
denmark.plot()

In [None]:
ukraine.plot()
denmark.plot()