# Pandas
- Pandas is an open-source data manipulation library for Python. 
- It provides easy-to-use data structures and data analysis tools for working with structured data, such as spreadsheets and databases.
  - **Series**: A Series is a one-dimensional array-like object in Pandas. It's often used for representing a single column or row of data within a DataFrame. Each element in a Series is labeled with an index, which can be a label or integer.
  - **DataFrame**: The core data structure in Pandas is the DataFrame. It is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as a powerful spreadsheet that can store and manipulate data.
  - Data Visualization: While Pandas is primarily a data manipulation library, it integrates well with data visualization libraries like Matplotlib and Seaborn, allowing us to create plots and charts to visualize your data.
- To install Pandas using conda
    - conda install -c anaconda pandas

In [1]:
# To use Pandas, we need to import it into our notebook.
import pandas as pd

import numpy as np

## Pandas Series
- A Pandas Series is a one-dimensional labeled array that can hold data of any type. 
- It's similar to a column in a spreadsheet or a single-dimensional array.

In [2]:
data = [10, 20, 30, 40, 50]
series = pd.Series(data)

# Display the Series
print("Pandas Series:")
print(series)

Pandas Series:
0    10
1    20
2    30
3    40
4    50
dtype: int64


In [3]:
# We can specify a custom index for the Series.
custom_index = ['A', 'B', 'C', 'D', 'E']
series_custom_index = pd.Series(data, index=custom_index)

# The .values attribute allows us to access the data (values) in the Pandas Series.
values = series_custom_index.values

# The .index attribute provides access to the index labels of the Pandas Series.
index = series_custom_index.index

# Access the second element
element = series[1]

# Access elements using custom index
element_A = series_custom_index['A']
element_D = series_custom_index.D

# Display the Series with a custom index
print("Pandas Series with Custom Index:")
print(series_custom_index)

print()

# Display the values of the Series
print("Values of the Series:")
print(values)

print()

# Display the index labels of the Series
print("Index Labels of the Series:")
print(index)

print()

# Display the accessed elements
print("Accessed Element in Series:", element)
print("Accessed Element 'A' in Custom Index Series:", element_A)
print("Accessed Element 'D' in Custom Index Series:", element_D)

Pandas Series with Custom Index:
A    10
B    20
C    30
D    40
E    50
dtype: int64

Values of the Series:
[10 20 30 40 50]

Index Labels of the Series:
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

Accessed Element in Series: 20
Accessed Element 'A' in Custom Index Series: 10
Accessed Element 'D' in Custom Index Series: 40


In [4]:
# We can perform various operations on Series.

# Element-wise Operations
# Perform arithmetic operations on Series.
result = series * 2

# Filtering
# Filter elements based on conditions.
filtered_series = series_custom_index[series_custom_index > 25]

# Display the results
print("Series * 2:")
print(result)
print()
print("Filtered Series (Elements > 25):")
print(filtered_series)

Series * 2:
0     20
1     40
2     60
3     80
4    100
dtype: int64

Filtered Series (Elements > 25):
C    30
D    40
E    50
dtype: int64


## Pandas DataFrame
- A Pandas DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). 
- It's similar to a spreadsheet or a SQL table.

### Creating a Pandas DataFrame
- We will explore different methods to create Pandas DataFrames.

In [5]:
# Creating a DataFrame from Lists
# We can create a DataFrame from a list of lists.
data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]
df_from_lists = pd.DataFrame(data, columns=['Name', 'Age'])

# Display the DataFrame
print("DataFrame created from lists:")
print(df_from_lists)

DataFrame created from lists:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [6]:
# Creating a DataFrame from a Dictionary
# We can create a DataFrame from a dictionary.

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df_from_dict = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame created from a dictionary:")
print(df_from_dict)

DataFrame created from a dictionary:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [7]:
# Creating a Pandas DataFrame from a NumPy Array
# Let's create a NumPy array to be used for creating the DataFrame.
array = np.array([[1, 5],
                   [2, 6],
                   [3, 7],
                   [4, 8]])

# Display the NumPy array
print("NumPy Array:")
print(array)

# We'll create a Pandas DataFrame using the NumPy array and specify custom row and column labels.
df = pd.DataFrame(array, index=['row1', 'row2', 'row3', 'row4'], columns=['col1', 'col2'])

# Display the Pandas DataFrame
print("\nPandas DataFrame:")
print(df)

NumPy Array:
[[1 5]
 [2 6]
 [3 7]
 [4 8]]

Pandas DataFrame:
      col1  col2
row1     1     5
row2     2     6
row3     3     7
row4     4     8


In [8]:
# Reading Data from the Saved CSV File
# Let's read the data from a CSV file.
df_read = pd.read_csv('cereal.csv')

# Display the DataFrame read from the CSV file
print("\nDataFrame Read from CSV File:")
print(df_read)


DataFrame Read from CSV File:
                         name  calories  fiber  sugars  vitamins     rating
0                   100% Bran        70   10.0       6        25  68.402973
1           100% Natural Bran       120    2.0       8         0  33.983679
2                    All-Bran        70    9.0       5        25  59.425505
3   All-Bran with Extra Fiber        50   14.0       0        25  93.704912
4              Almond Delight       110    1.0       8        25  34.384843
..                        ...       ...    ...     ...       ...        ...
72                    Triples       110    0.0       3        25  39.106174
73                       Trix       110    0.0      12        25  27.753301
74                 Wheat Chex       100    3.0       3        25  49.787445
75                   Wheaties       100    3.0       3        25  51.592193
76        Wheaties Honey Gold       110    1.0       8        25  36.187559

[77 rows x 6 columns]


### Basic DataFrame Operations
- We will explore some fundamental operations that can be performed on Pandas DataFrames.

In [9]:
data = {'Name': ['Alice', 'Bob'],
        'Age': [25, 30],
        'City': ['New York', 'Chicago']}

df = pd.DataFrame(data, index=['row1', 'row2'])

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# The .index attribute provides access to the index labels of the DataFrame.
index_labels = df.index

# The .columns attribute allows you to access the column labels of the DataFrame.
column_labels = df.columns

# The .values attribute provides access to the data in the DataFrame as a NumPy array.
data_values = df.values

# Display the index labels
print("\nDataFrame Index Labels:")
print(index_labels)

# Display the column labels
print("\nDataFrame Column Labels:")
print(column_labels)

# Display the data as a NumPy array
print("\nDataFrame Data as NumPy Array:")
print(data_values)

Sample DataFrame:
       Name  Age      City
row1  Alice   25  New York
row2    Bob   30   Chicago

DataFrame Index Labels:
Index(['row1', 'row2'], dtype='object')

DataFrame Column Labels:
Index(['Name', 'Age', 'City'], dtype='object')

DataFrame Data as NumPy Array:
[['Alice' 25 'New York']
 ['Bob' 30 'Chicago']]


In [10]:
# Let's create a sample DataFrame to perform operations on.
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

# Display the DataFrame
print("Sample DataFrame:")
print(df)

print()

# Viewing Data
# We can view and explore the data in a DataFrame.
# Head and Tail: Display the first and last few rows of the DataFrame.
head = df.head(2)  # Display the first 2 rows
tail = df.tail(2)  # Display the last 2 rows

# Shape
# Get the dimensions (rows and columns) of the DataFrame.
shape = df.shape

# Display the results
print("\nFirst 2 Rows:")
print(head)
print("\nLast 2 Rows:")
print(tail)
print("\nDataFrame Shape:")
print(shape)

Sample DataFrame:
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago


First 2 Rows:
    Name  Age           City
0  Alice   25       New York
1    Bob   30  San Francisco

Last 2 Rows:
      Name  Age         City
2  Charlie   35  Los Angeles
3    David   28      Chicago

DataFrame Shape:
(4, 3)


In [11]:
# Display the DataFrame
print("Sample DataFrame:")
print(df)

# Selecting Columns
# Select a specific column.
name_column = df['Name']

# Slicing Rows
# Slice rows based on index.
sliced_rows = df[1:3] # Slicing rows with index 1 and 2 

# Display the selected data
print("\nSelected 'Name' Column:")
print(name_column)
print("\nSliced Rows (Index 1-2):")
print(sliced_rows)

# The .loc attribute allows you to select data from the DataFrame using label-based indexing.
# Select a single cell using .loc.
cell_value = df.loc[1, 'Name']

# Slice rows and columns using .loc.
sliced_data = df.loc[1:2, ['Name', 'Age']]

# Display the results
print("\nSelected Cell Value (Row 1, Column 'Name'):")
print(cell_value)
print("\nSliced Data (Rows 1-2, Columns 'Name' and 'Age'):")
print(sliced_data)

# The .iloc attribute allows you to select data from the DataFrame using integer-based indexing.
# Select a single cell using .iloc.
cell_value = df.iloc[1, 0]

# Slice rows and columns using .iloc.
sliced_data_1 = df.iloc[1:3, 0:2]
sliced_data_2 = df.iloc[:, 0:2]

# Display the results
print("\nSelected Cell Value (Row 1, Column 0):")
print(cell_value)
print("\nSliced Data (Rows 1-2, Columns 0-1):")
print(sliced_data_1)
print("\nSliced Data (All Rows, Columns 0-1):")
print(sliced_data_2)

Sample DataFrame:
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   28        Chicago

Selected 'Name' Column:
0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

Sliced Rows (Index 1-2):
      Name  Age           City
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles

Selected Cell Value (Row 1, Column 'Name'):
Bob

Sliced Data (Rows 1-2, Columns 'Name' and 'Age'):
      Name  Age
1      Bob   30
2  Charlie   35

Selected Cell Value (Row 1, Column 0):
Bob

Sliced Data (Rows 1-2, Columns 0-1):
      Name  Age
1      Bob   30
2  Charlie   35

Sliced Data (All Rows, Columns 0-1):
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28


### Data Cleaning and Preprocessing
- We will explore data cleaning and preprocessing techniques using Pandas.

In [12]:
# Let's create a sample DataFrame.

data = {'Name': ['Alice', 'Bob', 'Kim'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Chicago', 'Miami']}

df = pd.DataFrame(data)

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# We can use .drop() to remove specific rows from the DataFrame.

# Remove rows with indices 0 and 2.
df_removed_rows = df.drop([0, 2])

# Display the DataFrame after removing rows
print("\nDataFrame After Removing Rows (Indices 0 and 2):")
print(df_removed_rows)

# We can also use .drop() to remove specific columns from the DataFrame.
# Remove the 'City' column.
df_removed_column = df.drop('City', axis=1)

# Display the DataFrame after removing a column
print("\nDataFrame After Removing 'City' Column:")
print(df_removed_column)

Sample DataFrame:
    Name  Age      City
0  Alice   25  New York
1    Bob   30   Chicago
2    Kim   35     Miami

DataFrame After Removing Rows (Indices 0 and 2):
  Name  Age     City
1  Bob   30  Chicago

DataFrame After Removing 'City' Column:
    Name  Age
0  Alice   25
1    Bob   30
2    Kim   35


In [13]:
# Let's create a sample DataFrame with some data that requires cleaning and preprocessing.
data = {'Name': [None, 'Bob', 'Charlie', 'Kim'],
        'Age': [25, 30, np.NaN, 35],
        'City': ['New York', '', 'Chicago', 'Denver']}

df = pd.DataFrame(data)

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# Handling missing values: Missing values are common in datasets and need to be handled.
# Detect missing values in the DataFrame.
missing_values = df.isnull()

# Remove rows with missing values.
df_no_missing = df.dropna()

# Display the results
print("\nMissing Values Detected:")
print(missing_values)
print("\nDataFrame After Dropping Rows with Missing Values:")
print(df_no_missing)

Sample DataFrame:
      Name   Age      City
0     None  25.0  New York
1      Bob  30.0          
2  Charlie   NaN   Chicago
3      Kim  35.0    Denver

Missing Values Detected:
    Name    Age   City
0   True  False  False
1  False  False  False
2  False   True  False
3  False  False  False

DataFrame After Dropping Rows with Missing Values:
  Name   Age    City
1  Bob  30.0        
3  Kim  35.0  Denver


In [14]:
# Let's create a sample DataFrame with some data that requires cleaning and preprocessing.
data = {'Name': [None, 'Bob', 'Charlie', 'Kim'],
        'Age': [25, 30, 15, 35],
        'City': ['New York', '', 'Chicago', 'Denver']}

df = pd.DataFrame(data)

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# We can rename columns to make them more descriptive.
df_renamed = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})

# Display the DataFrame with renamed columns
print("\nDataFrame with Renamed Columns:")
print(df_renamed)

# Values can be replaced.
# Replace empty strings in the 'City' column.
# When inplace = True, the data is modified in place, which means it will return nothing and the dataframe is now updated.
df['City'].replace('', 'Unknown', inplace=True)

# Display the DataFrame after replacing empty values
print("\nDataFrame After Replacing Empty Values:")
print(df)

# Converting Data Types
# Convert the 'Age' column to float.
df['Age'] = df['Age'].astype(float)

# Display the DataFrame after data type conversion
print("\nDataFrame After Data Type Conversion (Age to float):")
print(df)

Sample DataFrame:
      Name  Age      City
0     None   25  New York
1      Bob   30          
2  Charlie   15   Chicago
3      Kim   35    Denver

DataFrame with Renamed Columns:
  Full Name  Years      City
0      None     25  New York
1       Bob     30          
2   Charlie     15   Chicago
3       Kim     35    Denver

DataFrame After Replacing Empty Values:
      Name  Age      City
0     None   25  New York
1      Bob   30   Unknown
2  Charlie   15   Chicago
3      Kim   35    Denver

DataFrame After Data Type Conversion (Age to float):
      Name   Age      City
0     None  25.0  New York
1      Bob  30.0   Unknown
2  Charlie  15.0   Chicago
3      Kim  35.0    Denver


In [15]:
# Let's create a sample DataFrame with some data that requires cleaning and preprocessing.
data = {'Name': ['Charlie', 'Kim', 'Kim'],
        'Age': [np.NaN, 35, 35],
        'City': ['Chicago', 'Denver', 'Denver']}

df = pd.DataFrame(data)

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# Handling duplicates: Duplicate rows in a dataset can distort the analysis.
# Detect duplicate rows in the DataFrame.
duplicates = df.duplicated()

# Remove duplicate rows.
df_no_duplicates = df.drop_duplicates()

# Display the results
print("\nDuplicate Rows Detected:")
print(duplicates)
print("\nDataFrame After Removing Duplicates:")
print(df_no_duplicates)

Sample DataFrame:
      Name   Age     City
0  Charlie   NaN  Chicago
1      Kim  35.0   Denver
2      Kim  35.0   Denver

Duplicate Rows Detected:
0    False
1    False
2     True
dtype: bool

DataFrame After Removing Duplicates:
      Name   Age     City
0  Charlie   NaN  Chicago
1      Kim  35.0   Denver


In [16]:
# Let's create a sample DataFrame.
data = {'Name': ['Alice', 'David', 'Charlie'],
        'Age': [25, 30, 28]}

df = pd.DataFrame(data)

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# The .sort_index() method allows us to sort the DataFrame or Series by its index.
# Sort the DataFrame by index in ascending order.
df_sorted_index_asc_row = df.sort_index(axis=0, ascending=True) # By row
df_sorted_index_asc_col = df.sort_index(axis=1, ascending=True) # By column

# Sort the DataFrame by index in descending order.
df_sorted_index_desc = df.sort_index(ascending=False) 

# Display the results
print("\nDataFrame Sorted by Index (Row) (Ascending):")
print(df_sorted_index_asc_row)
print("\nDataFrame Sorted by Index (Column) (Ascending):")
print(df_sorted_index_asc_col)
print("\nDataFrame Sorted by Index (Descending):")
print(df_sorted_index_desc)

# The .sort_values() method allows us to sort the DataFrame or Series by its values.
# Sort the DataFrame by the 'Age' column in ascending order.
df_sorted_values_asc = df.sort_values(by='Age', ascending=True)

# Sort the DataFrame by the 'Age' column in descending order.
df_sorted_values_desc = df.sort_values(by='Age', ascending=False)

# Display the results
print("\nDataFrame Sorted by Values (Column Age) (Ascending):")
print(df_sorted_values_asc)
print("\nDataFrame Sorted by Values (Column Age) (Descending):")
print(df_sorted_values_desc)

Sample DataFrame:
      Name  Age
0    Alice   25
1    David   30
2  Charlie   28

DataFrame Sorted by Index (Row) (Ascending):
      Name  Age
0    Alice   25
1    David   30
2  Charlie   28

DataFrame Sorted by Index (Column) (Ascending):
   Age     Name
0   25    Alice
1   30    David
2   28  Charlie

DataFrame Sorted by Index (Descending):
      Name  Age
2  Charlie   28
1    David   30
0    Alice   25

DataFrame Sorted by Values (Column Age) (Ascending):
      Name  Age
0    Alice   25
2  Charlie   28
1    David   30

DataFrame Sorted by Values (Column Age) (Descending):
      Name  Age
1    David   30
2  Charlie   28
0    Alice   25


##  Basic Statistics
- We will explore basic statistics and data analysis using Pandas, a popular data manipulation library in Python.

In [17]:
# Let's create a sample DataFrame with some data for analysis.
data = {'Age': [25, 30, 35, 28, 32],
        'Salary': [500, 600, 750, 520, 800]}

df = pd.DataFrame(data)

# Display the sample DataFrame
print("Sample DataFrame:")
print(df)

# We can use .describe() to get summary statistics of the DataFrame.
descriptive_stats = df.describe()

# Display the summary statistics
print("\nDescriptive Statistics:")
print(descriptive_stats)

# We can calculate the mean and median of a column using .mean() and .median().
mean_age = df['Age'].mean()
median_salary = df['Salary'].median()

# Display the mean and median
print("\nMean Age:", mean_age)
print("Median Salary:", median_salary)

# We can calculate the variance and standard deviation of a column using .var() and .std().
variance_age = df['Age'].var()
std_dev_salary = df['Salary'].std()

# Display the variance and standard deviation
print("\nVariance of Age:", variance_age)
print("Standard Deviation of Salary:", std_dev_salary)

# We can calculate the correlation between columns using .corr().
correlation = df.corr()

# Display the correlation matrix
print("\nCorrelation Matrix:")
print(correlation)

# We can count the number of occurrences of each unique value in a column using .value_counts().
count_ages = df['Age'].value_counts()

# Display the count of unique values in the 'Age' column
print("\nCount of Ages:")
print(count_ages)

# We can group the data by a column and perform aggregation using .groupby() and aggregation functions.
grouped_data = df.groupby('Age').agg({'Salary': ['mean', 'max']})

# Display the grouped and aggregated data
print("\nGrouped and Aggregated Data:")
print(grouped_data)

Sample DataFrame:
   Age  Salary
0   25     500
1   30     600
2   35     750
3   28     520
4   32     800

Descriptive Statistics:
             Age     Salary
count   5.000000    5.00000
mean   30.000000  634.00000
std     3.807887  135.20355
min    25.000000  500.00000
25%    28.000000  520.00000
50%    30.000000  600.00000
75%    32.000000  750.00000
max    35.000000  800.00000

Mean Age: 30.0
Median Salary: 600.0

Variance of Age: 14.5
Standard Deviation of Salary: 135.20355024924456

Correlation Matrix:
             Age    Salary
Age     1.000000  0.878914
Salary  0.878914  1.000000

Count of Ages:
Age
25    1
30    1
35    1
28    1
32    1
Name: count, dtype: int64

Grouped and Aggregated Data:
    Salary     
      mean  max
Age            
25   500.0  500
28   520.0  520
30   600.0  600
32   800.0  800
35   750.0  750
