# Class 6: Introduction to Pandas pt.2

In this Jupyter Notebook you will learn about the basic workings of Pandas Data Frame structures. Please work through this document's Python-3 code cells to experience the power of the Pandas library.

Pandas is a standard data science libaray for Python-3. Pandas is built on top of the Numpy library so working with the various data structures should be easy to pick up quickly. You can read about Pandas (Data Frames) more @ the Pandas online docs: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

In [119]:
import pandas as pd

___
## Pandas Data Frames

A **Data Frame** is a multi-level Pandas structure used to store related data points in a _column-x-row_ fashion. Here are a few details about Pandas data frame:
* Data frames can be created from many data sources.
   * Hard coded
   * CSV file
   * Excell file
   * Online database
* Data frames have a wide range of meta-data which can be accessed via easy to use methods.
* Data frames can be 'indexed' in a number of ways by column or row key values.
* Data frames can be sliced by passing conditional statements into the indexing features.
* Data frames can be modified using code to add or remove values.

___
## Section 1:
### Creating Data Frames

Here are a few ways to create Pandas data frame (df):
1. Hard coding: writing out each value cell.
2. Importing from a CVS file.
3. Importing from a xcell file.

Additionally we will cover in this section how to add custom index values by:
1. Hard coding index values
2. Convert Column into index values

#### 1. Hard coding: writing out each value cell.

Data frames can be created by using the **pf.DataFrame()** method which takes a data structure. In this example, we have passed in a raw dictionary structure into the method.

You can read more about this method @: 

In [9]:
G7_df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [None]:
G7_df

Notice that the above data frame does not have custom named values for the indexs. Right now in order to access a row of data, we would use the numerical position of the row. Below you will see how to add custom index values to the "G7_df" structure. Once the cell below has been executed, return to the above cell and rerun it to see the changes!

In [7]:
G7_df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

#### 2.Importing from a CVS file.

Data frames can be created by _reading_ a Comma-Seperated-Value (CSV) file using the **pd.read_csv()** method. A CSV file contains the same structure as a data frame so it is quite simple to import.

_Notice_: When Pandas reads a csv file, the fist row will be considered the column names.

You can read more about this method @ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv

In [49]:
G7_df = pd.read_csv("G7_data.csv")

In [None]:
G7_df

#### 3. Importing from a Excel file.

Data frames can be create by _reading_ a excel (xlsx) file using the **pd.read_excel()** method. 

_Notice_: When Pandas reads a csv file, the fist row will be considered the column names.

You can read more about this method @ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [51]:
G7_df = pd.read_excel("G7_data.xlsx")

In [None]:
G7_df

#### 1. Hard coding index values
This process has already been seen at the top of section 1.

In [56]:
G7_df = pd.read_csv("G7_data.csv")

G7_df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

#### 2. Convert Column into index values
This process is completed by using the **df.set_index()** method.

_Notice_ this method is attatched to a data frame object (G7_df) not the pd library!

In [154]:
G7_df = pd.read_csv("g7_data_columnNames.csv")

G7_df = G7_df.set_index("Name")

G7_df

Unnamed: 0_level_0,Population,GDP,Surface Area,HDI,Continent
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


___
## Section 2:
### Getting Information About Data Frames

Data frames are packed with information! Using some built in methods, you can quickly access informations that will help you better make decisions on processing. Here are the methods you will practice:

|Method/Attribute|Desc.|
|:----------|:-----|
|df.columns|Outputs the names of the columns in order|
|df.index|Outputs the names of indexs in order|
|df.info()|Outputs the columns, # filled values, data types|
|df.size|Total number of values|
|df.shape|(#-index, #-columns)|
|df.head()|Output the first 10 rows (indexs)|
|df.tail()|Output the last 10 rows (indexs)|
|df.dtypes|Output the data types of each column|
|df.dtypes.value_counts()|Output the number of each data type|
|df.describe()|Output basic statistical information on columns with numerical data|

In [None]:
G7_df.columns

In [None]:
G7_df.index

In [None]:
# output the index's using a for-loop
for index in G7_df.index:
    print(index)

In [None]:
G7_df.info()

In [None]:
G7_df.size

In [None]:
G7_df.shape

In [None]:
G7_df.head()

In [None]:
G7_df.tail()

In [None]:
G7_df.describe()

In [None]:
G7_df.dtypes

In [None]:
G7_df.dtypes.value_counts()

___
## Section 3:
### Indexing data frames

Similar to the Pandas series, data frames can be indexed a number of ways. Indexing and slicing is a very important process in data science. Sometimes the data set your program is working with is too big, or contains many columns/rows that are not neccesary to the overall goal (at the moment). Knowing how to index is very important, please make sure to practice these concepts! In this section you will see how to index our data frame (G7_df) in the following ways:


|Index Style|Desc.|
|:----------|:-----|
|df.iloc[]|Indexing row based on the numerical position|
|df.iloc[1:4]|Indexing a slice of rows|
|df.iloc[1,2]|Indexing an explicit cell|
|df.iloc[1, [0,-1]]|Indexing an explicit row and display the selected column(s)|
|df.iloc[1:3, [0,-1]]|Indexing an explicit selection of rows and display the selected column(s)|
|df.iloc[1:3, 0]|Indexing an explicit selection of rows and display the selected column(s) unformatted|
|df.iloc[1:3, 0:1]|Indexing an explicit selection of rows and display the selected column(s)|
|df["column"]|Index the column based on its custom name|
|df.loc["key-value"]|Index the row based on its custom name|
|df.loc["key" : "key"]|Index a slice of rows based on custom index names|
|df.loc["key" : "key", "column"]|Index a slice of rows/column based on custom index names|
|df.loc["key" : "key", ["column", "column"]]|Index a slice of rows/columns based on custom index names|
|df['column'].to_frame()|Index a column and show in formatted table|

* **iloc()** is "Purely integer-location based indexing for selection by position."(docs). 
* * Learn more about the attributes and features of **iloc** by visting the docs @ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
* **loc()** is "Access a group of rows and columns by label(s) or a boolean array."(docs). 
* * :earn more about the attributes and features of **iloc** by visting the docs @ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

_Notice_: The term 'key' is the index's name (name of countries for G7_df).

_Notice_ the **indexing** styles returns a new data frame, in order to 'save' the new data frame, you will need to store it in some object.

In [None]:
G7_df.iloc[0]

In [None]:
G7_df.iloc[1:4]

In [None]:
G7_df.iloc[1,2]

In [None]:
G7_df.iloc[1, [0,-1]]

In [None]:
G7_df.iloc[1:3, [0,-1]]

In [None]:
G7_df.iloc[1:3, 0]

In [None]:
G7_df.iloc[1:3, 0:1]

In [None]:
G7_df["Population"]

In [None]:
G7_df.loc["United States"]

In [None]:
G7_df.loc["Canada" : "Japan"]

In [None]:
G7_df.loc["Germany" : "Japan", "HDI"]

In [None]:
G7_df.loc["Germany" : "Japan", ["GDP", "HDI"]]

In [None]:
G7_df['Population'].to_frame()

___
## Section 4:
### Conditional Indexing

Similar to the Pandas series, data frames can be indexed by passing conditional statements inside the brackets. This is a very powerful tool used to isolate sub sections of data:

|Index Style|Desc.|
|:----------|:-----|
|df["column"] > val|Return the rows that their columns meet a condition (no data)|
|df.loc[df['column'] > 70]|Return the rows that their columns meet a condition (with data)|
|df.loc[df['column'] > 70, 'column']|Return the rows that their columns meet a condition (with specified column-data)|
|df.loc[df['column'] > 70, ['column', 'column']]|Return the rows that their columns meet a condition (with specified columns-data)|

_Notice_ the **conditional indexing** styles returns a new data frame, in order to 'save' the new data frame, you will need to store it in some object.

In [None]:
G7_df["Population"] > 70

In [None]:
G7_df.loc[G7_df['Population'] > 70]

In [None]:
G7_df.loc[G7_df['Population'] > 70, 'GDP']

In [None]:
G7_df.loc[G7_df['Population'] > 70, ['Population', 'GDP']]

___
## Section 5:
### Removing rows/columns using df.drop()

Similar to the Pandas series, data frames can be indexed by passing conditional statements inside the brackets. This is a very powerful tool used to isolate sub sections of data:

|Index Style|Desc.|
|:----------|:-----|
|df.drop('key')|Remove a row with specific index|
|df.drop(['key', 'key'])|Remove a specified set of rows|
|df.drop(columns=['column', 'column'])|Remove a specific set of columns|
|df.drop(['key', 'key'], axis=0)|Removes a specific set of rows (when axis=0)|
|df.drop(['column', 'column'], axis=1)|Removes a specific set of columns (when axis=1)|
|df.drop(['column', 'column'], axis='columns')|Removes a specific set of columns (when axis='columns')|
|df.drop(['key', 'key'], axis='rows')|Removes a specific set of rows (when axis='rows')|

_Notice_ the **df.drop()** method returns a new data frame, in order to 'save' the new data frame, you will need to store it in some object.

In [None]:
G7_df.drop('France')

In [None]:
G7_df.drop(['Italy', 'Canada'])

In [None]:
G7_df.drop(columns=['Surface Area', 'Continent'])

In [None]:
G7_df.drop(['United Kingdom', 'United States'], axis=0)

In [None]:
G7_df.drop(['Population', 'GDP'], axis=1)

In [None]:
G7_df.drop(['Population', 'GDP'], axis='columns')

In [None]:
G7_df.drop(['United Kingdom', 'United States'], axis='rows')

___
## Section 6:
### Data Frame Operations

Using the basics of indexing, data frames can execute arithmetic operations on one or more columns at a time. 

In [None]:
# Fro demo output only
G7_df[['Population', 'GDP']]

In [None]:
G7_df[['Population', 'GDP']] / 100

Data frames and series can be merged together using arithmetic operations when the columns are aligned.

In [None]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis

In [None]:
G7_df[['GDP', 'HDI']] + crisis

___
## Section 7:
### Modifying Data Frames

In this section you will see the following ways to modify a Pandas data frame:
* New columns can be created by including simple Pandas series.
* New rows can be created using the **append()** method.
* Using the basics of indexing, data frames can execute arithmetic operations on one or more columns at a time.
* Basic statistical analysis can be conducted on data frame columns.

In [141]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)

In [None]:
langs

In [None]:
G7_df['Language'] = langs

In [None]:
G7_df

In [None]:
# replacing all values in a column
G7_df['Language'] = 'English'

In [None]:
G7_df

In [None]:
# Renaming columns and index values
G7_df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    }, index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

_Notice_ that in the above code, we have added the renaming of a column that does not exist! This is simply to demonstate that Pandas can recognize this and knows how to handle it (by ignoring it). The same is true for changing the index "Argentina" to "AR" which also does not exist.

In [None]:
# Making index uppercase
G7_df.rename(index=str.upper)

In [None]:
# adding new rows (with index value)
G7_df.append(pd.Series({
    'Population': 3,
    'GDP': 5 
}, name="China"))

In [None]:
# Modifying a column using a series object
G7_df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})

In [None]:
G7_df

In [None]:
# removing index names
G7_df.reset_index()

In [None]:
# Creating new column based on operations of existing columns
G7_df['GDP Per Capita'] = G7_df['GDP'] / G7_df['Population']

#### General statistical processing of a single column.

In [None]:
population = G7_df['Population']

In [None]:
population.min(), population.max()

In [None]:
population.sum()

In [None]:
population.sum() / len(population)

In [None]:
population.mean()

In [None]:
population.std()

In [None]:
population.median()

In [None]:
population.describe()

In [None]:
population.quantile(.25)

In [None]:
population.quantile([.2, .4, .6, .8, 1])