# Pandas Foundations 1

<font size="3"> 

- Quick recap
- Jupyter Server
- Subsetting 
- Introduction to Pandas (1)
  - Grammatical verbs
- Q&A
    
    
</font> 

## Dictionaries

- A dictionary is a collection of items
- Each item of a dictionary has a key/value pair
- Dictionaries are used to store data values
- As of Python version 3.7, dictionaries are ordered. In previous versios, dictionaries are unordered.

In [None]:
pop = [30.55, 2.77, 39.21]

countries = ["afghanistan", "albania", "algeria"]

In [None]:
type(countries)

In [None]:
# -- A dictionary is a collection which is ordered*, changeable and does not allow duplicates.

world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}

world

In [None]:
type(world)

In [None]:
world["albania"]

### Access dictionary

If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for France from europe you can use:

europe['france']

Here, 'france' is the key and 'paris' the value is returned.

In [None]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Print out the keys in europe
print(europe.keys())


In [None]:
# Print out value that belongs to key 'spain'
print(europe['spain'])

### Dictionary Manipulation (1)

- Add

If you know how to access a dictionary, you can also assign a new value to it. To add a new key-value pair to europe you can use something like this:

europe['iceland'] = 'reykjavik'

In [None]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

europe

In [None]:
# Add italy to europe
europe['italy'] = 'rome'

In [None]:
# Print out italy in europe
print('italy' in europe)


In [None]:
# Add poland to europe
europe['poland'] = 'warsaw'

In [None]:
# Print europe
print(europe)

### Dictionary Manipulation (2)

- Upate
- Delete

In [None]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna' }

print(europe)



In [None]:
# Update capital of germany
europe['germany'] = 'berlin'

europe

In [None]:
# Remove australia
del(europe['australia'])


In [None]:
# Print europe
print(europe)

### Dictionary Manipulation (3)

- Dictionaries can contain key:value pairs where the values are again dictionaries.

As an example, have a look at the script where another version of europe - the dictionary you've been working with all along - is coded. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.

It's perfectly possible to chain square brackets to select elements. To fetch the population for Spain from europe, for example, you need:

europe['spain']['population']

In [None]:
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }


europe

In [None]:
# Print out the capital of France
print(europe['france']['capital'])

In [None]:
# Create sub-dictionary data
data = { 'capital':'rome', 'population':59.83 }

data

In [None]:
# Add data to europe under key 'italy'
europe['italy'] = data


In [None]:
# Print europe
europe

## Pandas

What is pandas?

- Python library for data analysis
    - <https://pandas.pydata.org/getting_started.html>
    - <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>
    
- Built on Numpy

- DataFrame
    - Tidy data 
    - <https://www.jstatsoft.org/article/view/v059i10>

- High-performance containers for data analysis

- Data structures with a lot of functionality
 - Meaningful labels
 - Time series functionality
 - Handling missing data
 - Relational operations

### Dictionary to DataFrame (1)

In [None]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

type(names)

In [None]:
# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }

type(my_dict)

In [None]:
my_dict

In [None]:
# Import pandas as pd
import pandas as pd

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)
type(cars)

### Dataframe

![title](imgs/dataframe.png)

### CSV to DataFrame (1)

- Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. 

- One of those file types is the CSV file, which is short for "comma-separated values".

- To import CSV data into Python as a Pandas DataFrame you can use read_csv().


In [None]:
# --Import pandas as pd
#import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv('data/cars.csv')

# Print out cars
print(cars)

### CSV to DataFrame (2)

- Your read_csv() call to import the CSV data didn't generate an error, but the output is not entirely what we wanted. 

- The row labels were imported as another column without a name.


In [None]:
# Fix import by including index_col
cars = pd.read_csv('data/cars.csv', index_col = 0)

# Print out cars
print(cars)

### Square Brackets (1)

- To select only the cars_per_cap column from cars, you can use:

- The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

    - cars['cars_per_cap']
        - <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html>

    - cars[['cars_per_cap']]
        - <https://pandas.pydata.org/pandas-docs/stable/reference/frame.html>
        

In [None]:
# Print out country column as Pandas Series
print(cars['country'])


In [None]:
type(cars['country'])

In [None]:
# Print out country column as Pandas DataFrame
print(cars[['country']])


In [None]:
type(cars[['country']])

In [None]:
# Print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])

### Square Brackets (2)

- Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. 

- The following call selects the first five rows from the cars DataFrame:

    - cars[0:5]

- The result is another DataFrame containing only the rows you specified.


In [None]:
# Print out first 3 observations
print(cars[0:3])


In [None]:
# Print out fourth, fifth and sixth observation
print(cars[3:6])

### loc and iloc (1)

- With loc and iloc you can do practically any data selection operation on DataFrames you can think of. 

- **loc** is label-based, which means that you have to specify rows and columns based on their row and column labels. 

    - <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html>

- **iloc** is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

    - <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html>


cars.loc['RU']   
cars.iloc[4]   

cars.loc[['RU']]   
cars.iloc[[4]]   

cars.loc[['RU', 'AUS']]   
cars.iloc[[4, 1]]   



In [None]:
#import pandas as pd

cars = pd.read_csv('data/cars.csv', index_col = 0)

cars

In [None]:
# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])

In [None]:
# Print out observation for Japan
print(cars.iloc[2])


### loc and iloc (2)

- loc and iloc also allow you to select both rows and columns from a DataFrame. 


In [None]:
cars

In [None]:
cars.loc['IN', 'cars_per_cap']

In [None]:
cars.iloc[3, 0]

In [None]:
cars.loc[['IN', 'RU'], 'cars_per_cap']

In [None]:
cars.iloc[[3, 4], 0]

In [None]:
cars.loc[['IN', 'RU'], ['cars_per_cap', 'country']]

In [None]:
cars.iloc[[3, 4], [0, 1]]

### loc and iloc (3)

- It's also possible to select only columns with loc and iloc. 

- In both cases, you simply put a slice going from beginning to end in front of the comma:
  

In [None]:
cars


In [None]:
cars.loc[:, 'country'] 

In [None]:
cars.iloc[:, 1]

In [None]:
cars.loc[:, ['country','drives_right']]  

In [None]:
cars.iloc[:, [1, 2]] 

### Methods

- head()
- tail()
- info()


In [None]:
apple_stock = pd.read_csv("data/appl_1980_2014.csv")


In [None]:
apple_stock.head()

In [None]:
apple_stock.tail()

In [None]:
apple_stock.info()

## Use Case:

### Datasets from CSV files

- CSV file has no column headers

- Missing values

- Domain knowledge

- <http://www.sidc.be/silso/home>

In [None]:
filepath = 'data/ISSN_D_tot.csv'

In [None]:
sunspots = pd.read_csv(filepath)

#sunspots = pd.read_csv("data/ISSN_D_tot.csv")

sunspots

In [None]:
sunspots.info()

In [None]:
sunspots.iloc[10:20, :]

### Using header keyword

In [None]:
sunspots = pd.read_csv(filepath, header=None)

sunspots

In [None]:
sunspots.iloc[10:20, :]

### Using names keyword

In [None]:
col_names = ['year', 'month', 'day', 'dec_date', 'sunspots', 'definite']

col_names

In [None]:
type(col_names)

In [None]:
sunspots = pd.read_csv(filepath, header=None, names=col_names)

sunspots 

In [None]:
sunspots.iloc[10:20, :]

### Using na_values keyword (1)

- read_csv() 
    - <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>

In [None]:
na_valuessunspots = pd.read_csv(filepath, header=None,
                  names=col_names, na_values='-1')

sunspots

In [None]:
sunspots.iloc[10:20, :]

In [None]:
#-- space in '-1' and ' -1'

sunspots = pd.read_csv(filepath, header=None,
                      names=col_names, na_values=' -1')

sunspots

### Using na_values keyword (3)

In [None]:
sunspots = pd.read_csv(filepath, header=None,
    ...: names=col_names, na_values={'sunspots':[' -1']})

sunspots

In [None]:
sunspots.info()

### Using parse_dates keyword

- read_csv() 
    - <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>
  

In [None]:
sunspots = pd.read_csv(filepath, 
                       header=None,
                        names=col_names, na_values={'sunspots':[' -1']},
                        parse_dates=[[0, 1, 2]])

sunspots.iloc[10:20, :]

In [None]:
sunspots.info()

### Writing files

- to_csv()

    - <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html>
    

In [None]:
#sunspots.to_csv("data/ISSN_D_tot_CLEAN_5.csv")