# Lab 4 - Playing with Pandas

In this lab we will learn some of the basics of using Pandas data structures in python.  For a longer, more in-depth tutorial on Pandas, go to /mst688-data-science-applications/Basic Python Examples/D1_L5_Pandas/.

Here are a couple links to additional notebooks with more examples.

[Pandas Objects](../Basic%20Python%20Examples/D1_L5_Pandas/01-Introducing-Pandas-Objects.ipynb)<br>
[Indexing and Selection](../Basic%20Python%20Examples/D1_L5_Pandas/02-Data-Indexing-and-Selection.ipynb)<br>


## Storing data in data structures

A Pandas series is a one-dimensional array of indexed data.  A DataFrame is a two-dimensional array of indexed data.  Series and DataFrames can be created in multiple ways by either entering the data directly or reading from an external file.

In [63]:
import pandas as pd

# Creating a series from a list of values and a list of labels (indices)
population_list = pd.Series([ 38332521,26448193,19651127,19552860,12882135],
                 index=['California','Texas','New York','Florida','Illinois'])

print("Series from list: ")
print(population_list)
print(type(population_list))
print()

# Creating a series from a dictionary of key:value pairs
area_dict = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297,'Florida': 170312, 'Illinois': 149995})
print("Series from dictionary: ")
print(area_dict)
print(type(area_dict))
print()

# Creating a DataFrame from two series.
states_df = pd.DataFrame({'population': population_list,'area': area_dict})
print("DataFrame from Series:")
print(states_df)
print(type(states_df))
print()

#Creating a DataFrame from csv file.
states_csv = pd.read_csv('states.csv',index_col='name')
print("DataFrame from csv:")
print(states_csv)
print(type(states_csv))
print()

Series from list: 
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64
<class 'pandas.core.series.Series'>

Series from dictionary: 
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64
<class 'pandas.core.series.Series'>

DataFrame from Series:
            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995
<class 'pandas.core.frame.DataFrame'>

DataFrame from csv:
            population    area
name                          
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995
<class 'pandas.core.frame.DataFrame'>



## Data Indexing and Selection

DataFrames make it easy to extract specific data as a smaller DataFrame, a new series, or an individual value.

In [64]:
# Get a series from a DataFrame
area = states_df['area']
print("Series from DataFrame:")
print(area)
print(type(area))
print()

# Get a specific value from a DataFrame
print("Specific value from DataFrame:")
print("Area of California = "+str(states_df.loc['California','area']))
print()

# Get slices from a DataFrame based on index values
print("Get subset of dataframe, Texas through Florida:")
print(states_df['Texas':'Florida'])
print(type(states_df['Texas':'Florida']))
print()

# Get slice from a DataFrame based on numerical column
print("Get subset of dataframe, area > 200000:")
print(states_df[states_df['area']>200000])

Series from DataFrame:
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
<class 'pandas.core.series.Series'>

Specific value from DataFrame:
Area of California = 423967

Get subset of dataframe, Texas through Florida:
          population    area
Texas       26448193  695662
New York    19651127  141297
Florida     19552860  170312
<class 'pandas.core.frame.DataFrame'>

Get subset of dataframe, area > 200000:
            population    area
California    38332521  423967
Texas         26448193  695662


## Modifying DataFrames
When data is stored in a DataFrame, you can add rows or columns and even create calculated columns from the data in existing columns.

In [66]:
# Add a row
states_df = states_df.append(pd.DataFrame({'population':3949000,'area':181038},index=['Oklahoma']))
print("Add a row to DataFrame:")
print(states_df)
print()

# Add a series as a column
density = states_df['population']/states_df['area']
states_df['density'] = density
print("Add calculated column to DataFrame:")
print(states_df)


Add a row to DataFrame:
            population    area     density
California    38332521  423967   90.413926
Texas         26448193  695662   38.018740
New York      19651127  141297  139.076746
Florida       19552860  170312  114.806121
Illinois      12882135  149995   85.883763
Oklahoma       3949000  181038   21.813100
Oklahoma       3949000  181038         NaN

Add calculated column to DataFrame:
            population    area     density
California    38332521  423967   90.413926
Texas         26448193  695662   38.018740
New York      19651127  141297  139.076746
Florida       19552860  170312  114.806121
Illinois      12882135  149995   85.883763
Oklahoma       3949000  181038   21.813100
Oklahoma       3949000  181038   21.813100
