# What is Pandas?
<a id="pandas"> </a>

The `pandas` package is one of the most popular Python tools for data management and manipulation. `pandas` is built *on top* of `numpy`. Thus, much of the functionality and methods that are available in `numpy` are also available in `pandas`. 

## Getting started
The`pandas` package is included with Anconda, but can be installed using either `conda` or `pip`.
```Python
# Use default channel
conda install pandas

# Specify the conda-forge channel
conda install -c conda-forge pandas

# Use pip
pip install pandas

```

### Set max rows/columns

To see more than the default number of rows and column, set the display options.

In [57]:
import pandas as pd

pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

# Series and DataFrames
A pandas Series object is a one-dimensional labeled array that can hold any data type. It is one of two fundamental data structures provided by the pandas library. The other data structure is the DataFrame, which we'll examine next. Isolating a single column from a DataFrame results in a Series object.

A Series consists of two main components: the index and the data. The index provides labels for each element in the Series, allowing for easy and efficient data access and alignment. The data component contains the actual values.

You can create a Series using various data sources, such as lists, arrays, dictionaries, DataFrames, or even other Series objects. Here's an example of creating a Series from a list:

In [58]:
import pandas as pd

In [59]:
grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
grade_series = pd.Series(grades)
grade_series

0     88.0
1     67.0
2    100.0
3     92.0
4      NaN
5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

Note that a default index is added to the grades to create the series object.

Alternatively, you can specify an index. In this case, the stduent ID is provided as the index.

In [60]:
grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
students = ['dmac', 'edev', 'joeb', 'tdog', 'txroy', 'sthicks', 'jfrerk', 'spickard', 'choenes', 'jsisson']
student_grades_series = pd.Series(grades, students)
student_grades_series

dmac         88.0
edev         67.0
joeb        100.0
tdog         92.0
txroy         NaN
sthicks      95.0
jfrerk       82.0
spickard    100.0
choenes     100.0
jsisson      95.0
dtype: float64

Understanding what type of object you're working with is important in any programming language. Different objects (classes) have different methods and attributes. A pandas Series object has different methods and attributes than a pandas DataFrame. Below is a partial listing of the methods available with Series objects.

## Series functions

### head()/tail()
View the first few or last few items in a Series using head/tail.

In [61]:
grade_series.head(10)

0     88.0
1     67.0
2    100.0
3     92.0
4      NaN
5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

In [62]:
grade_series.tail()

5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

### Math functions
* describe() - Display descriptive statistics of your data using the ```describe()``` function.
* sum()
* min()
* max()
* mean()
* median()
* std()

In [63]:
grade_series.describe()

count      9.000000
mean      91.000000
std       10.851267
min       67.000000
25%       88.000000
50%       95.000000
75%      100.000000
max      100.000000
dtype: float64

In [64]:
grade_series.min()

67.0

In [65]:
grade_series.median()

95.0

### Data Manipulation Functions
* isnull() - checks for missing values (null/NaN)
* unique() - returns an list of unique values
* value_counts() - returns the fequencies of unique values
* apply(function) - applies a function to each element
* dropna() - returns a new series with missing values removed

### isnull()
Use ```isnull()``` to check for null values. A True/False series is returned, which corresponds to each item in the series. True indicates the value is null (NaN). NaN means "Not a Number."

In [66]:
grade_series.isnull()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
dtype: bool

### unique() - Find unique values

In [67]:
print(grade_series.unique())

[ 88.  67. 100.  92.  nan  95.  82.]


### apply() function
Use the apply function to modify every item in a series using a standard or custom function. In this example, we use a custom function to create a series containing the letter grade.

In [68]:
def number_to_letter_grade(score):

    if score > 89:
        return "A"
    elif score > 79:
        return "B"
    elif score > 69:
        return "C"
    elif score > 59:
        return "D"
    else:
        return None
    

In [69]:
number_to_letter_grade(62)

'D'

apply() - Modifying values using a function

In [70]:
letter_series = grade_series.apply(number_to_letter_grade)

In [71]:
letter_series

0       B
1       D
2       A
3       A
4    None
5       A
6       B
7       A
8       A
9       A
dtype: object

In [72]:
grades_no_missing = grade_series.dropna()
grades_no_missing

0     88.0
1     67.0
2    100.0
3     92.0
5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

Notice that the None/null/NaN item has been removed.

# Indexes

# Locating and Filtering data

In [73]:
import pandas as pd
df_people = pd.read_csv('files/people_data.csv')
print(df_people.shape)
df_people.head()

(200, 12)


Unnamed: 0,First Name,Last Name,Gender,Age,Email,Phone,Education,Occupation,Experience (Years),Salary,Marital Status,Number of Children
0,amy,moore,Female,26,a.moore@randatmail.com,177-8697-63,Bachelor,Astronomer,11,118590,Married,4
1,rosie,henderson,Female,29,r.henderson@randatmail.com,747-7768-48,Primary,Manager,14,42540,Single,1
2,garry,cooper,Male,29,g.cooper@randatmail.com,131-0615-33,Upper secondary,Agronamist,11,149123,Single,3
3,sarah,miller,Female,27,s.miller@randatmail.com,811-2617-15,Primary,Pharmacist,6,97946,Single,2
4,rubie,sullivan,Female,23,r.sullivan@randatmail.com,543-4162-06,Bachelor,Engineer,8,78613,Married,4


## Loading data and specifying an index
If the data you are loading has a column that should be used as an index, you can specify that column as the index. Although there are better attributes to use for an index, we could index individuals using their email address number using the following statement. Note that the Email column appears far left and in bold indicating is is the index.

In [74]:
df_people = pd.read_csv('files/people_data.csv', index_col = 'Email')
df_people.head()

Unnamed: 0_level_0,First Name,Last Name,Gender,Age,Phone,Education,Occupation,Experience (Years),Salary,Marital Status,Number of Children
Email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
a.moore@randatmail.com,amy,moore,Female,26,177-8697-63,Bachelor,Astronomer,11,118590,Married,4
r.henderson@randatmail.com,rosie,henderson,Female,29,747-7768-48,Primary,Manager,14,42540,Single,1
g.cooper@randatmail.com,garry,cooper,Male,29,131-0615-33,Upper secondary,Agronamist,11,149123,Single,3
s.miller@randatmail.com,sarah,miller,Female,27,811-2617-15,Primary,Pharmacist,6,97946,Single,2
r.sullivan@randatmail.com,rubie,sullivan,Female,23,543-4162-06,Bachelor,Engineer,8,78613,Married,4


In [75]:
df_people.iloc[:5,[0,1,3]]

Unnamed: 0_level_0,First Name,Last Name,Age
Email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a.moore@randatmail.com,amy,moore,26
r.henderson@randatmail.com,rosie,henderson,29
g.cooper@randatmail.com,garry,cooper,29
s.miller@randatmail.com,sarah,miller,27
r.sullivan@randatmail.com,rubie,sullivan,23


The iloc function in pandas is used for indexing and selecting data from a DataFrame based on integer positions. It allows you to specify row and column indices to access specific data points or subsets of the DataFrame.

The general syntax of iloc is:

```Python
df.iloc[row_index(s), column_index(s)]


```

In [76]:
# Return the first row. Since it is one-dimensional, it is returned as a Series.
df_people.iloc[1]

First Name                  rosie
Last Name               henderson
Gender                     Female
Age                            29
Phone                 747-7768-48
Education                 Primary
Occupation                Manager
Experience (Years)             14
Salary                      42540
Marital Status             Single
Number of Children              1
Name: r.henderson@randatmail.com, dtype: object

In [77]:
# Return the first 8 rows and first 2 columns (the 0 can be ommitted)
df_people.iloc[0:8,0:2]

Unnamed: 0_level_0,First Name,Last Name
Email,Unnamed: 1_level_1,Unnamed: 2_level_1
a.moore@randatmail.com,amy,moore
r.henderson@randatmail.com,rosie,henderson
g.cooper@randatmail.com,garry,cooper
s.miller@randatmail.com,sarah,miller
r.sullivan@randatmail.com,rubie,sullivan
f.williams@randatmail.com,fiona,williams
t.carter@randatmail.com,thomas,carter
s.martin@randatmail.com,sawyer,martin


In [78]:
# Return rows 1, 3, and 5 and show only the first name, last name, and salary
df_people.iloc[[1,3,5],[0,1,9]]

Unnamed: 0_level_0,First Name,Last Name,Marital Status
Email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
r.henderson@randatmail.com,rosie,henderson,Single
s.miller@randatmail.com,sarah,miller,Single
f.williams@randatmail.com,fiona,williams,Single


## Using loc()
The ```loc()``` function in pandas is used for indexing and selecting data from a DataFrame based on labels. It allows you to specify row and column labels to access specific data points or subsets of the DataFrame.

The general syntax of loc is:

```Python
df.loc[row_label(s), column_label(s)]
```

In [96]:
df_people.loc['r.sullivan@randatmail.com']

First Name          rubie
Last Name        sullivan
Phone         543-4162-06
Name: r.sullivan@randatmail.com, dtype: object

In [98]:
df_people.loc['r.sullivan@randatmail.com':'e.robinson@randatmail.com', ['First Name', 'Last Name', 'Phone']]

Unnamed: 0_level_0,First Name,Last Name,Phone
Email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
r.sullivan@randatmail.com,rubie,sullivan,543-4162-06
f.williams@randatmail.com,fiona,williams,807-4311-40
t.carter@randatmail.com,thomas,carter,281-1436-40
s.martin@randatmail.com,sawyer,martin,905-3877-91
e.robinson@randatmail.com,eleanor,robinson,049-5493-56


In [79]:
married_filter = (df_people['Marital Status'] == "Married")

In [90]:
df_email_index = df_people.loc[married_filter]
df_email_index.head()

Unnamed: 0_level_0,First Name,Last Name,Gender,Age,Phone,Education,Occupation,Experience (Years),Salary,Marital Status,Number of Children
Email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
a.moore@randatmail.com,amy,moore,Female,26,177-8697-63,Bachelor,Astronomer,11,118590,Married,4
r.sullivan@randatmail.com,rubie,sullivan,Female,23,543-4162-06,Bachelor,Engineer,8,78613,Married,4
s.martin@randatmail.com,sawyer,martin,Male,20,905-3877-91,Upper secondary,Lawer,7,140405,Married,0
e.robinson@randatmail.com,eleanor,robinson,Female,20,049-5493-56,Primary,Scientist,6,194147,Married,3
a.kelley@randatmail.com,adrianna,kelley,Female,22,251-3368-86,Upper secondary,Actor,0,160569,Married,1


## Using logic and multiple attributes to locate data
If we wanted to find Managers and Engineers who are married, we would use the following filter. Note the use of '&' for 'and' and the vertial bar '|' for 'or'.

In [81]:
multi_filter = (df_people['Marital Status'] == "Married") & ((df_people['Occupation']=='Manager') | (df_people['Occupation']=='Engineer'))

In [82]:
df_people.loc[multi_filter]

Unnamed: 0_level_0,First Name,Last Name,Gender,Age,Phone,Education,Occupation,Experience (Years),Salary,Marital Status,Number of Children
Email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
r.sullivan@randatmail.com,rubie,sullivan,Female,23,543-4162-06,Bachelor,Engineer,8,78613,Married,4
a.cameron@randatmail.com,alisa,cameron,Female,30,178-7304-99,Lower secondary,Engineer,7,193299,Married,3
m.morrison@randatmail.com,mike,morrison,Male,30,999-4285-82,Doctoral,Manager,8,94263,Married,2
v.foster@randatmail.com,violet,foster,Female,18,923-6053-60,Master,Manager,2,180192,Married,4
r.edwards@randatmail.com,robert,edwards,Male,21,913-0408-26,Upper secondary,Engineer,8,173872,Married,2
a.richards@randatmail.com,arthur,richards,Male,28,292-0356-03,Upper secondary,Manager,4,99047,Married,1
e.russell@randatmail.com,eleanor,russell,Female,29,383-7318-08,Primary,Manager,11,161017,Married,2
j.west@randatmail.com,james,west,Male,19,498-5620-38,Primary,Engineer,1,43406,Married,5


# Modifying data

## unique() and nunique()

The ```unique()``` function returns an array or list containing all the unique values present in a Series or column of a DataFrame. It eliminates any duplicates and provides a sorted array of unique values.

The ```nunique()``` function returns the count of unique values in a Series or column of a DataFrame. It provides the total number of distinct values, excluding any duplicates.

In [89]:
print(df_people['Education'].unique())
print(df_people['Education'].nunique())

['Bachelor' 'Primary' 'Upper secondary' 'Doctoral' 'Lower secondary'
 'Master']
6


# Add / Remove Rows and Columns

# Grouping and Aggregating

In [85]:
df_people.describe()

Unnamed: 0,Age,Experience (Years),Salary,Number of Children
count,200.0,200.0,200.0,200.0
mean,23.85,6.07,115400.58,2.63
std,3.87201,3.71539,48731.819296,1.654323
min,18.0,0.0,31721.0,0.0
25%,20.0,3.0,70128.5,1.0
50%,24.5,6.0,116373.5,3.0
75%,27.0,9.0,155637.25,4.0
max,30.0,14.0,199056.0,5.0


# Handling Missing Values
Missing values are a common problem when performing data analysis. This section discusses ways to handle missing values.

In [None]:
import random
import numpy as np

import pandas as pd
df_people = pd.read_csv('files/people_data.csv')

First, to make it easier, we'll create a small subset of the people DataFrame. Note the use of ```copy()```. Using ```copy()``` creates a *deep copy* of the DataFrame. This avoids making changes to a slice or shallow copy of the DataFrame, which would yield unpredictable results and an exception

In [None]:
small_df = df_people.iloc[:10,:].copy()
small_df.head()
small_df.shape

Next, we'll randomly add missing values using the ```np.NaN``` object and create one row and column of NaN values.

In [None]:
# Introduce missing (NaN) values into the small DataFame

num_missing = 5

for x in range(num_missing):
    random_row = random.randint(0,9)
    random_col = random.randint(0,9)
    small_df.iloc[random_row, random_col] = np.NaN

# Add a column of NaN values
small_df['SS#'] = np.NaN

# Add a row of NaN values
small_df.loc[len(small_df.index)] = np.NaN

The syntax of ```dropna``` is as follow:

```Python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

- **axis**: drop rows (```axis=0```) or columns (```axis=1```) which contain missing values
- **how**: drop (row|column) if it contains ```any``` missing value; only drop (row|column) if ```all``` values are missing

### Sum the count of missing values by column

In [None]:
small_df.isna().sum()

# Casting Data Types
After creating a pandas DataFrame, you may have columns that have not been defined using the correct data type. For example, after using the ```read_csv()``` function to create the people_df, salary is defined as an integer instead of a float. 

In [None]:
df_people.dtypes

In [None]:
df_people['Salary'] = df_people['Salary'].astype(float)

# Working with Time Series Data

# Interacting with Excel, JSON, Parquet files, SQL

## Using iloc

# Using Polars with Pandas

## 

# Initialize everything

This section is provided to quickly re-load the series and dataframe objects used in this notebook

In [None]:
import pandas as pd
import numpy as np

grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
grade_series = pd.Series(grades)

grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
students = ['dmac', 'edev', 'joeb', 'tdog', 'txroy', 'sthicks', 'jfrerk', 'spickard', 'choenes', 'jsisson']
student_grades_series = pd.Series(grades, students)

df_people = pd.read_csv('files/people_data.csv')
small_df = df_people.iloc[:10,:].copy()

In [None]:
df_nfl = pd.read_csv('files/NFLPlaybyPlay2015.csv', low_memory = False)

In [None]:
df_nfl.info()

In [None]:
df_nfl['InterceptionThrown'].unique()

In [None]:
df_nfl.isna().sum().sum()

In [None]:
df_nfl['Date2'] = pd.to_datetime(df_nfl['Date'])

In [None]:
df_nfl['Date2'].nunique()