# What is Pandas?
<a id="pandas"> </a>

The `pandas` package is one of the most popular Python tools for data management and manipulation. `pandas` is built *on top* of `numpy`. Thus, much of the functionality and methods that are available in `numpy` are also available in `pandas`. 

## Getting started
The`pandas` package is included with Anconda, but can be installed using either `conda` or `pip`.
```Python
# Use default channel
conda install pandas

# Specify the conda-forge channel
conda install -c conda-forge pandas

# Use pip
pip install pandas

```

# Series and DataFrames
A pandas Series object is a one-dimensional labeled array that can hold any data type. It is one of two fundamental data structures provided by the pandas library. The other data structure is the DataFrame, which we'll examine next. Isolating a single column from a DataFrame results in a Series object.

A Series consists of two main components: the index and the data. The index provides labels for each element in the Series, allowing for easy and efficient data access and alignment. The data component contains the actual values.

You can create a Series using various data sources, such as lists, arrays, dictionaries, DataFrames, or even other Series objects. Here's an example of creating a Series from a list:

In [None]:
import pandas as pd

In [None]:
grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
grade_series = pd.Series(grades)
grade_series

Note that a default index is added to the grades to create the series object.

Alternatively, you can specify an index. In this case, the stduent ID is provided as the index.

In [7]:
grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
students = ['dmac', 'edev', 'joeb', 'tdog', 'txroy', 'sthicks', 'jfrerk', 'spickard', 'choenes', 'jsisson']
student_grades_series = pd.Series(grades, students)
student_grades_series

dmac         88.0
edev         67.0
joeb        100.0
tdog         92.0
txroy         NaN
sthicks      95.0
jfrerk       82.0
spickard    100.0
choenes     100.0
jsisson      95.0
dtype: float64

Understanding what type of object you're working with is important in any programming language. Different objects (classes) have different methods and attributes. A pandas Series object has different methods and attributes than a pandas DataFrame. Below is a partial listing of the methods available with Series objects.

## Series functions

### head()/tail()
View the first few or last few items in a Series using head/tail.

In [8]:
grade_series.head(10)

0     88.0
1     67.0
2    100.0
3     92.0
4      NaN
5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

In [9]:
grade_series.tail()

5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

### Math functions
* describe() - Display descriptive statistics of your data using the ```describe()``` function.
* sum()
* min()
* max()
* mean()
* median()
* std()

In [10]:
grade_series.describe()

count      9.000000
mean      91.000000
std       10.851267
min       67.000000
25%       88.000000
50%       95.000000
75%      100.000000
max      100.000000
dtype: float64

In [11]:
grade_series.min()

67.0

In [12]:
grade_series.median()

95.0

### Data Manipulation Functions
* isnull() - checks for missing values (null/NaN)
* unique() - returns an list of unique values
* value_counts() - returns the fequencies of unique values
* apply(function) - applies a function to each element
* dropna() - returns a new series with missing values removed

### isnull()
Use ```isnull()``` to check for null values. A True/False series is returned, which corresponds to each item in the series. True indicates the value is null (NaN). NaN means "Not a Number."

In [15]:
grade_series.isnull()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
dtype: bool

### unique() - Find unique values

In [16]:
print(grade_series.unique())

[ 88.  67. 100.  92.  nan  95.  82.]


### apply() function
Use the apply function to modify every item in a series using a standard or custom function. In this example, we use a custom function to create a series containing the letter grade.

In [17]:
def number_to_letter_grade(score):

    if score > 89:
        return "A"
    elif score > 79:
        return "B"
    elif score > 69:
        return "C"
    elif score > 59:
        return "D"
    else:
        return None
    

In [18]:
number_to_letter_grade(62)

'D'

apply() - Modifying values using a function

In [25]:
letter_series = grade_series.apply(number_to_letter_grade)

In [26]:
letter_series

0       B
1       D
2       A
3       A
4    None
5       A
6       B
7       A
8       A
9       A
dtype: object

In [28]:
grades_no_missing = grade_series.dropna()
grades_no_missing

0     88.0
1     67.0
2    100.0
3     92.0
5     95.0
6     82.0
7    100.0
8    100.0
9     95.0
dtype: float64

Notice that the None/null/NaN item has been removed.

In [31]:
grades_no_missing = grade_series.dropna().reset_index(drop=True)
grades_no_missing

0     88.0
1     67.0
2    100.0
3     92.0
4     95.0
5     82.0
6    100.0
7    100.0
8     95.0
dtype: float64

# Indexes

# Locating and Filtering data

In [32]:
import pandas as pd
df_people = pd.read_csv('files/people_data.csv')
df_people.head()
df_people.shape

(200, 12)

In [34]:
df_people.iloc[:5,[0,1,3]]

Unnamed: 0,First Name,Last Name,Age
0,amy,moore,26
1,rosie,henderson,29
2,garry,cooper,29
3,sarah,miller,27
4,rubie,sullivan,23


The iloc function in pandas is used for indexing and selecting data from a DataFrame based on integer positions. It allows you to specify row and column indices to access specific data points or subsets of the DataFrame.

The general syntax of iloc is:

```Python
df.iloc[row_index(s), column_index(s)]


```

In [35]:
# Return the first row. Since it is one-dimensional, it is returned as a Series.
df_people.iloc[1]

First Name                                 rosie
Last Name                              henderson
Gender                                    Female
Age                                           29
Email                 r.henderson@randatmail.com
Phone                                747-7768-48
Education                                Primary
Occupation                               Manager
Experience (Years)                            14
Salary                                     42540
Marital Status                            Single
Number of Children                             1
Name: 1, dtype: object

In [36]:
# Return the first 8 rows and first 2 columns (the 0 can be ommitted)
df_people.iloc[0:8,0:2]

Unnamed: 0,First Name,Last Name
0,amy,moore
1,rosie,henderson
2,garry,cooper
3,sarah,miller
4,rubie,sullivan
5,fiona,williams
6,thomas,carter
7,sawyer,martin


In [37]:
# Return rows 1, 3, and 5 and show only the first name, last name, and salary
df_people.iloc[[1,3,5],[0,1,9]]

Unnamed: 0,First Name,Last Name,Salary
1,rosie,henderson,42540
3,sarah,miller,97946
5,fiona,williams,65368


## Using loc()
The ```loc()``` function in pandas is used for indexing and selecting data from a DataFrame based on labels. It allows you to specify row and column labels to access specific data points or subsets of the DataFrame.

The general syntax of loc is:

```Python
df.loc[row_label(s), column_label(s)]
```

In [38]:
married_filter = (df_people['Marital Status'] == "Married")

# Modifying data

## unique() and nunique()

In [44]:
df_people.Occupation = df_people.Occupation.apply(lambda x: x.replace("Lawer", "Lawyer"))

In [45]:
df_people.head(20)

Unnamed: 0,First Name,Last Name,Gender,Age,Email,Phone,Education,Occupation,Experience (Years),Salary,Marital Status,Number of Children
0,amy,moore,Female,26,a.moore@randatmail.com,177-8697-63,Bachelor,Astronomer,11,118590,Married,4
1,rosie,henderson,Female,29,r.henderson@randatmail.com,747-7768-48,Primary,Manager,14,42540,Single,1
2,garry,cooper,Male,29,g.cooper@randatmail.com,131-0615-33,Upper secondary,Agronamist,11,149123,Single,3
3,sarah,miller,Female,27,s.miller@randatmail.com,811-2617-15,Primary,Pharmacist,6,97946,Single,2
4,rubie,sullivan,Female,23,r.sullivan@randatmail.com,543-4162-06,Bachelor,Engineer,8,78613,Married,4
5,fiona,williams,Female,25,f.williams@randatmail.com,807-4311-40,Doctoral,Lecturer,7,65368,Single,5
6,thomas,carter,Male,24,t.carter@randatmail.com,281-1436-40,Bachelor,Veteranarian,5,64881,Single,4
7,sawyer,martin,Male,20,s.martin@randatmail.com,905-3877-91,Upper secondary,Lawyer,7,140405,Married,0
8,eleanor,robinson,Female,20,e.robinson@randatmail.com,049-5493-56,Primary,Scientist,6,194147,Married,3
9,adrianna,kelley,Female,22,a.kelley@randatmail.com,251-3368-86,Upper secondary,Actor,0,160569,Married,1


In [47]:
print(df_people['Occupation'].nunique())

55


# Add / Remove Rows and Columns

# Grouping and Aggregating

# Handling Missing Values
Missing values are a common problem when performing data analysis. This section discusses ways to handle missing values.

In [48]:
import random
import numpy as np

import pandas as pd
df_people = pd.read_csv('files/people_data.csv')

First, to make it easier, we'll create a small subset of the people DataFrame. Note the use of ```copy()```. Using ```copy()``` creates a *deep copy* of the DataFrame. This avoids making changes to a slice or shallow copy of the DataFrame, which would yield unpredictable results and an exception

In [53]:
small_df = df_people.iloc[:10,:].copy()
print(small_df.shape)
small_df.head()

(10, 12)


Unnamed: 0,First Name,Last Name,Gender,Age,Email,Phone,Education,Occupation,Experience (Years),Salary,Marital Status,Number of Children
0,amy,,Female,26,a.moore@randatmail.com,177-8697-63,Bachelor,Astronomer,11,118590,Married,4
1,rosie,henderson,Female,29,r.henderson@randatmail.com,747-7768-48,Primary,Manager,14,42540,Single,1
2,garry,,Male,29,g.cooper@randatmail.com,131-0615-33,Upper secondary,Agronamist,11,149123,Single,3
3,sarah,miller,Female,27,s.miller@randatmail.com,811-2617-15,Primary,Pharmacist,6,97946,Single,2
4,rubie,sullivan,Female,23,r.sullivan@randatmail.com,543-4162-06,Bachelor,Engineer,8,78613,Married,4


Next, we'll randomly add missing values using the ```np.NaN``` object and create one row and column of NaN values.

In [57]:
# Introduce missing (NaN) values into the small DataFame

num_missing = 5

for x in range(num_missing):
    random_row = random.randint(0,9)
    random_col = random.randint(0,9)
    small_df.iloc[random_row, random_col] = np.NaN

# Add a column of NaN values
small_df['SS#'] = np.NaN

# Add a row of NaN values
small_df.loc[len(small_df.index)] = np.NaN

The syntax of ```dropna``` is as follow:

```Python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

- **axis**: drop rows (```axis=0```) or columns (```axis=1```) which contain missing values
- **how**: drop (row|column) if it contains ```any``` missing value; only drop (row|column) if ```all``` values are missing

### Sum the count of missing values by column

In [58]:
small_df.isna().sum()

First Name             5
Last Name              6
Gender                 5
Age                    7
Email                  5
Phone                  6
Education              9
Occupation             4
Experience (Years)     6
Salary                 5
Marital Status         4
Number of Children     4
SS#                   14
dtype: int64

In [59]:
small_df.isna().sum().sum()

80

# Casting Data Types
After creating a pandas DataFrame, you may have columns that have not been defined using the correct data type. For example, after using the ```read_csv()``` function to create the people_df, salary is defined as an integer instead of a float. 

In [None]:
df_people.dtypes

In [None]:
df_people['Salary'] = df_people['Salary'].astype(float)

# Working with Time Series Data

# Interacting with Excel, JSON, Parquet files, SQL

## Using iloc

# Using Polars with Pandas

## 

# Initialize everything

This section is provided to quickly re-load the series and dataframe objects used in this notebook

In [None]:
import pandas as pd
import numpy as np

grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
grade_series = pd.Series(grades)

grades = [88, 67, 100, 92, None, 95, 82, 100, 100, 95]
students = ['dmac', 'edev', 'joeb', 'tdog', 'txroy', 'sthicks', 'jfrerk', 'spickard', 'choenes', 'jsisson']
student_grades_series = pd.Series(grades, students)

df_people = pd.read_csv('files/people_data.csv')
small_df = df_people.iloc[:10,:].copy()

In [None]:
df_nfl = pd.read_csv('files/NFLPlaybyPlay2015.csv', low_memory = False)

In [None]:
df_nfl.info()

In [None]:
df_nfl['InterceptionThrown'].unique()

In [None]:
df_nfl.isna().sum().sum()

In [None]:
df_nfl['Date2'] = pd.to_datetime(df_nfl['Date'])

In [None]:
df_nfl['Date2'].nunique()