# Installing packages in Python  


### In a Jupyter notebook  

* latest version: %pip install pandas
* specific version: %pip install pandas=2.1.1

NOTE: Some versions of Jupyter use !pip instead of %pip

In [1]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 25.0.1
[notice] To update, run: C:\Users\LucyKnight\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### After installing the package, import it into the environment.  

import pandas as pd  
* pd is a common ALIAS for pandas.  
* makes it quicker to type references to the package when you need it.
* makes your code more concise and more readable. 

In [3]:
import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data, name="MySeries")

# Vectorized operations of series
series = series + 15

In [4]:
series

0    16
1    17
2    18
3    19
4    20
Name: MySeries, dtype: int64

### TASK 1: Installing and importing modules  
* Download and open the 'Introduction to Pandas.ipynb' notebook.  
* Complete the exercises in the __Task 1__ section.  
* OPTIONAL: read up on how to install packages in a Python environment outside ot Jupyter notebooks.

# Data structures

## Series  
* Similar to a Python dictionary.  
* A mapping of index values to data values.
* A one-dimensional labelled array
* Can hold various data types.
* Similar to a column in an Excel spreadsheet or a single column in a SQL table.
* Labelling
    * Each element in a Series has a label or an index, allowing for  easy data access and manipulation.
    * Default numeric values assigned, if not defined.
* Homogeneous data - a series typically stores data of the same data type.
* Vectorised Operations - allows efficient element-wise calculations. Operations can be performed on entire columns or Series without explicit loops.


In [5]:
# Create a pandas series from a dictionary 
d = {'a':42., 'b':6., 'c':2.5677}
s = pd.Series(d)
print(s)

a    42.0000
b     6.0000
c     2.5677
dtype: float64


## DataFrame  
* 2-dimensional tabular structure with labelled axes.  
* Primary data structure for analysis in Pandas.  
* Similar to spreadsheets and SQL tables.  

A dataframe has:  
* __COLUMNS:__ Each column in a dataframe is a series.
* __INDEXING:__ Dataframes have both row and column indices.  
* __DATA ALIGNMENT:__ Dataframes can align data based on labels.  
* __DATA INTEGRATION:__ You can merge, join, and concatenate dataframes, and combine and analyse data from various sources.

In [8]:
data = {
 "Name": ["Alice", "Bob", "Charlie", "Derek"],
 "Age": [25, 30, 35, 40],
 "City": ["New York", "San Francisco", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles
3,Derek,40,Chicago


# Acquiring data

In [None]:
# installs only need to be run once per environment
# uncomment to install package
# %pip install mysql-connector-python
# import pandas as pd 
import mysql.connector

# Load data from csv file with the name - data.csv
df_csv = pd.read_csv('data.csv')
# Load data from excel file with the name - data.xlsx
df_excel = pd.read_excel('data.xlsx')

# You can specify a specific sheet using the sheet_name parameter
df_sheet1 = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Create a MySQL connection
 # Establishes a connection to a MySQL database
 # Replace 'username', 'password', 'localhost', and 'mydatabase'
 # with your actual MySQL credentials and database details
cnx = mysql.connector.connect(user='username', password='password',
                              host='localhost',
                              database='mydatabase') 

# Load data from a SQL database table
query = 'SELECT * FROM mytable'
df_sql = pd.read_sql_query(query, cnx)

# Load data from an HTML table on a webpage
url = 'https://example.com/data-table.html'
df_html_table = pd.read_html(url)

# Load data from a JSON file
df_json = pd.read_json('data.json')

# Load data from Parquet file
df_parquet = pd.read_parquet('data.parquet')


In [None]:
df = pd.read_csv("employees.csv")


Unnamed: 0,Name,Age,City,Salary,Gender,column1,State,Start Date,DateOfBirth
0,kyle,59.0,indianapolis,151000.0,Other,Raymond,Indiana,1990-12-09,1964-10-15 14:21:44.568274
1,Luis,31.0,Los Angeles,58000.0,,Andrade,CA,1992-02-24,1992-10-08 14:21:44.568274


In [13]:
df.columns

Index(['Name', 'Age', 'City', 'Salary', 'Gender', 'column1', 'State',
       'Start Date', 'DateOfBirth'],
      dtype='object')

In [14]:
df.rename(columns={'column1':'Surname'}, inplace=True)
df.sample()

Unnamed: 0,Name,Age,City,Salary,Gender,Surname,State,Start Date,DateOfBirth
15,Andre,54.0,allentown,151000.0,,Martin,Pennsylvania,1991-05-05,1969-10-14 14:21:44.568274


In [2]:
data = {
 "Name": ["Alice", "Bob", "Charlie", "David"],
 "Age": [25, 30, 35, 40],
 "City": ["New York", "San Francisco", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)

# Renaming the 'Name' column to 'Person_Name'
# inplace=True, will save the changes.
df.rename(columns={'Name':'Person_Name'}, inplace=True)

# Exploring your data 

### Opening a CSV file  
* read_csv() loads the csv file contents into a dataframe.  
* the read_csv() function has many parameters.  
  * __delimiter:__ the character used to separate values in the CSV file. Default is ',’.
  * __header:__ the row number to use as the column names. Default is 0 (the first row).
  * __index_col:__ the column to set as the index of the DataFrame.
  * __na_values:__ a list of values that should be considered as NaN (Not a Number).

In [None]:
# Load data from csv file with the name - data.csv
df_csv = pd.read_csv('separator.csv',delimiter=";") # TSV tab separated variables /  values delimiter = "\t"
df_csv.sample()

Unnamed: 0,Name,Age,City,Salary,Gender,column1,State,Start Date,DateOfBirth
86,bruce,57.0,Evansville,77000.0,Other,Harrison,Indiana,1999-05-26,1966-10-15


#### Displaying dataframes  
* __.head(n)__ First n rows of the DataFrame.  
* __.columns__ Column names of the DataFrame.  
* __.tail(n)__ Last n rows of the DataFrame.  
* __.sample(n)__ Random n rows from the DataFrame.  



In [19]:
df.head(10)

Unnamed: 0,Name,Age,City,Salary,Gender,Surname,State,Start Date,DateOfBirth
0,kyle,59.0,indianapolis,151000.0,Other,Raymond,Indiana,1990-12-09,1964-10-15 14:21:44.568274
1,Luis,31.0,Los Angeles,58000.0,,Andrade,CA,1992-02-24,1992-10-08 14:21:44.568274
2,katherine,46.0,Naperville,146000.0,female,Gutierrez,IL,21-12-2015,1977-10-12 14:21:44.568274
3,robert,25.0,pittsburgh,66000.0,Male,Yates,Pennsylvania,1993-01-25,1998-10-07 14:21:44.568274
4,austin,49.0,Naperville,96000.0,,Turner,IL,20-09-1979,1974-10-13 14:21:44.568274
5,christopher,48.0,Buffalo,47000.0,other,Carroll,NY,1990-12-09,1975-10-13 14:21:44.568274
6,michelle,45.0,Fort Wayne,61000.0,Male,Anderson,Indiana,1991-05-05,1978-10-12 14:21:44.568274
7,Joshua,38.0,miami,96000.0,,Roberts,Florida,2006-07-11,1985-10-10 14:21:44.568274
8,Nancy,32.0,new york,129000.0,Female,Ferguson,NY,1992-05-23,1991-10-09 14:21:44.568274
9,Dillon,56.0,chicago,67000.0,Female,Horn,IL,2006-07-11,1967-10-15 14:21:44.568274


In [20]:
df.columns

Index(['Name', 'Age', 'City', 'Salary', 'Gender', 'Surname', 'State',
       'Start Date', 'DateOfBirth'],
      dtype='object')

In [21]:
print(df.tail())

           Name   Age             City   Salary  Gender      Surname  \
155   Jennifer   38.0   san francisco   58000.0     NaN      Palmer    
156       Sean   35.0     Los Angeles   70000.0   Male       Harris    
157      Laura   28.0      Pittsburgh   53000.0   male      Jackson    
158      kiara   54.0           Tampa   33000.0     NaN   Hernandez    
159   Jennifer   38.0   san francisco   58000.0     NaN      Palmer    

              State    Start Date                 DateOfBirth  
155             CA     2017-01-31  1985-10-10 14:21:44.568274  
156             CA     2011-07-06  1988-10-09 14:21:44.568274  
157   Pennsylvania     1991-05-05  1995-10-08 14:21:44.568274  
158        Florida    03-01-1985   1969-10-14 14:21:44.568274  
159             CA     2017-01-31  1985-10-10 14:21:44.568274  


In [22]:
print(df.sample(2))

          Name   Age          City   Salary  Gender     Surname      State  \
6    michelle   45.0   Fort Wayne   61000.0   Male    Anderson    Indiana    
26    shelley   50.0       aurora   60000.0     NaN    Stevens         IL    

    Start Date                 DateOfBirth  
6   1991-05-05  1978-10-12 14:21:44.568274  
26  1990-12-09  1973-10-13 14:21:44.568274  


### Data exploration - shape, describe(), info()


* __shape__ 
    * Gives a set;
    * First element specifies the number of samples/rows.
    * Second element specifies the number of columns.
* __describe()__  
    * Generates basic statistics for each numeric column in the DataFrame.
    * Includes count, mean, standard deviation, minimum, and maximum values.
* __info()__  
    * Provides a concise summary of the DataFrame.
    * Includes data types, non-null counts, and memory usage.

In [24]:
# Get a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         160 non-null    object 
 1   Age          159 non-null    float64
 2   City         156 non-null    object 
 3   Salary       159 non-null    float64
 4   Gender       118 non-null    object 
 5   Surname      158 non-null    object 
 6   State        156 non-null    object 
 7   Start Date   158 non-null    object 
 8   DateOfBirth  160 non-null    object 
dtypes: float64(2), object(7)
memory usage: 11.4+ KB


In [25]:
# Get number of rows and columns
print(df.shape)

(160, 9)


In [26]:
# Get basic statistics for numeric columns
df.describe()

Unnamed: 0,Age,Salary
count,159.0,159.0
mean,40.773585,86503.144654
std,11.12525,40741.83528
min,20.0,23000.0
25%,32.0,54000.0
50%,41.0,77000.0
75%,50.0,115000.0
max,59.0,191000.0


### Data exploration - nunique, .column_name  

Tools for categorical or discrete data.
  
* __nunique__
    * Calculates the number of unique values in each column.
    * Useful for understanding data diversity in categorical columns.

* __.column_name__
    * Access a specific column in the DataFrame
    * [‘column_name’] is used when the column name has spaces.


In [None]:
# Select a column named 'column'
df.City


In [None]:
# Select a column named 'column'
# Both gives the same results
df['City']

### Basic statistics  

#### For categorical data:  
* Explore values and their frequencies.  
* Calculate additional statistics for specific columns using mathematical functions.
* Use __mean__, __median__ or __mode__ to calculate the average value of a column.
* Examples:  
    * df.Age.mean()
    * df.Salary.median()
    * df.Gender.mode()

In [9]:
data = {
 "Name": ["Alice", "Bob", "Charlie", "David"],
 "Age": [25, 30, 35, 40],
 "City": ["New York", "San Francisco", "Los Angeles", "Chicago"],
 "Salary": [25000, 27000, 23000, 37500],
 "Gender": ["Female", "Male", "Male", "Male"]
}
df = pd.DataFrame(data)

In [10]:
# Calculate the average value of a column
df.Age.mean()

32.5

In [11]:
# Find the middle value of a column
df.Salary.median()

26000.0

In [12]:
#Determine the most frequent value in a column
df.Gender.mode()

0    Male
Name: Gender, dtype: object

### TASK 2: Data loading in Pandas  
* Complete the exercises in the __Task 2__ section.  
* OPTIONAL: Open the 'separator.csv' file in a notepad application and compare with the 
'employees.csv' file for the separators.

# Data selection

__Data selection__ and __indexing__ are fundamental operations in Pandas.  
They allow extraction of specific subsets of data from a DataFrame.  

* Indexing
    * Selecting particular rows and columns of data from a DataFrame. Can be known as Subset Selection.
    * Selecting Specific Columns and Rows:
        * Use square brackets [], .loc[], and .iloc[] indexing methods.
        * [ ] - Select one or more columns by their names.
        * .loc[ ] - Select rows or columns by label.
        * .iloc[ ] - Select rows and columns by integer location.


#### By name

In [16]:
# Select 'Age' and 'Name' columns
selected_columns = df[['Age', 'Name']]
print(selected_columns)

   Age     Name
0   25    Alice
1   30      Bob
2   35  Charlie
3   40    David


#### By label

__.loc[]__   
* Label-based selection.
* Select rows and columns by label.
* Specify both row and column labels.
    * *selected_data = df.loc[3:6, ['Column1', 'Column2']]*
    * Note: Both inner and outer indices are inclusive.


In [18]:
# Select 'Age' and 'City' columns for rows 3 to 6
# Note: In .loc[], both start and end indices are inclusive
selected_data = df.loc[2:6, ['Age', 'City']]
print(selected_data)

   Age         City
2   35  Los Angeles
3   40      Chicago


#### By position

__.iloc[]__  
* Integer-based selection.
* Select rows and columns by integer location.
* Useful for numeric indexing.
    * *selected_data = df.iloc[1:4, 0:2]*  
    * Note: Only the inner index is inclusive in index slicing, the outer index is exclusive.


In [19]:
# Select first two columns (assuming they are 'Salary' and 'Gender')
# for rows 1 to 3
# Note: In .iloc[], start index is inclusive but end index is exclusive
selected_data = df.iloc[1:4, 0:2]
print(selected_data)

      Name  Age
1      Bob   30
2  Charlie   35
3    David   40


#### By filtering and conditionals  

In [None]:
# Filter rows where 'Age' is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

Boolean indexing  
* Create a boolean mask by applying a condition to a column.  
* Use this mask to filter rows for the True condition.


In [20]:
# Create a boolean mask by applying a condition to a column
# Use this mask to filter rows for the True condition
boolean_mask = df['Age'] > 30
filtered_data = df[boolean_mask]
print(filtered_data)

      Name  Age         City
2  Charlie   35  Los Angeles
3    David   40      Chicago


In [21]:
boolean_mask = (df['Age'] > 25) & (df['Salary'] > 29000)
filtered_data = df[boolean_mask]
print(filtered_data)

KeyError: 'Salary'

In [22]:
# Use .isin() method
# filter rows where 'City' is either 'New York' or 'Chicago'
mask = df['City'].isin(['New York', 'Chicago'])
filtered_data = df[mask]

print(filtered_data)

    Name  Age      City
0  Alice   25  New York
3  David   40   Chicago


#### Data indexing  
Pandas provides various methods to customise a dataframe index:  
* __set_index()__  
    * Set one or more columns as the DataFrames index.
    * Useful for performing operations on a specific column.
    * To save changes, modify the existing DataFrame with the updated one, or use __inplace=True__.
* __reset_index()__
    * Resets the index to the default integer index.
    * Optionally removes the existing index.


In [None]:
# Set 'Name' column as index and save changes
df = df.set_index('Name') 
print(df)

In [None]:
# You can use inplace=True to directly save changes
df.set_index('Name', inplace=True)
print(df)

In [None]:
# Reset index and remove existing index
#df = df.reset_index(drop=True)
# NOTE drop=True will remove the exisitin index altogether - data may be lost!
print(df)


# Data cleaning

### Removing columns or rows  

__drop()__  
* Remove rows or columns from a dataframe.  
* You can remove a row or column based on its index or label.  
    * Row = index.  
    * Column = label.  

You might want to remove cells or rows/columns from a Pandas DataFrame in cases where data is incorrect, missing or irrelevant for your analysis.  
Removing cells/rows/columns can help you clean and pre-process the data for further analysis.

In [27]:
# Creating a sample dataframe 
df = pd.DataFrame({'A': [1,2,3,4,5],
                   'B': [5,4,3,2,1],
                   'C': [10,20,30,40,50]})
print(df)

   A  B   C
0  1  5  10
1  2  4  20
2  3  3  30
3  4  2  40
4  5  1  50


In [28]:
# removing the 2nd row (index = 1)
df = df.drop(index=1)
print(df)

   A  B   C
0  1  5  10
2  3  3  30
3  4  2  40
4  5  1  50


In [29]:
# Removing column 'B'
df = df.drop(columns='B')
print(df)

   A   C
0  1  10
2  3  30
3  4  40
4  5  50


In [37]:
# Remake the DataFrame after editing the indexes
data = {
 "Name": ["Alice", "Bob", "Charlie", "David", None, "Alice"],
 "Age": [25, 30, 35, None, 40, 25],
 "City": ["New York", "San Francisco", None, "Los Angeles", "Chicago", "New York"],
 "Salary": [25000, 27000, None, 23000, 37500, 25000],
 "Gender": ["Female", "Male", None, "Male", "Male", "Female"]
}
df = pd.DataFrame(data)

print(df)

      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0  Female
1      Bob  30.0  San Francisco  27000.0    Male
2  Charlie  35.0           None      NaN    None
3    David   NaN    Los Angeles  23000.0    Male
4     None  40.0        Chicago  37500.0    Male
5    Alice  25.0       New York  25000.0  Female


### Correcting date formats



* Incorrect date formats can cause problems during processing and analysing the data.
* A dataframe with a column named ‘Date’ contains date values in the format ‘dd/mm/yyyy’
* We need to perform operations on the dates, so we convert the format to ‘yyyy-mm-dd’ (another standard format for dates).
* Conversion is done using the pandas to_datetime() method.

In [34]:
# Create a sample dataframe 
data = {'Date': ['01/01/2001','02/01/2001','03/01/2001']}
dates_df = pd.DataFrame(data)
print(dates_df)

         Date
0  01/01/2001
1  02/01/2001
2  03/01/2001


In [35]:
# Converting the date format  
dates_df['Date'] = pd.to_datetime(dates_df['Date'], format = '%d/%m/%Y')

# Chnging the format of hte date column 
dates_df['Date'] = dates_df['Date'].dt.strftime('%Y-%m-%d')

# Check the results 
print(df)

         Date
0  2001-01-01
1  2001-01-02
2  2001-01-03


### Dealing with missing values  

Missing values occur when no data is stored for certain observations in a variable. 
They can arise due to data entry errors, data collection problems, or automatic conversions or truncation during data processing.  

Why we can’t just leave the errors:  
* Data Integrity  
    * Missing values can lead to incorrect or biased analysis results.   
    * For example, calculating the mean of a column with missing values will give an inaccurate result.  
* Machine Learning Models won't handle them.   
    * Most machine learning algorithms do not support data with missing values.  
    * It is crucial to handle them before feeding the data into an algorithm.  

How we handle them:  
* Data Imputation  
    * Filling missing values, also known as imputation.  
    * This can make the dataset complete and improve the quality of the data.  
    * Methods include using a constant value, mean, median, mode, or using predictive modelling.  
* Removing Data  
    * In some cases, it might be better to remove the observations with missing values, especially if they are a small subset of the data.  
    * However, if they are a larger proportion of the dataset, this could lead to loss of information.  

In Pandas, we can use:
* __isnull()__ or __isna()__ to detect missing values, and  
* __dropna()__ or __fillna()__ to remove or fill missing values.  

By default, __dropna()__ removes the entire row where ANY column value is missing.  
We can control this by specifying __how='all'__ if we want to only remove rows where ALL of the column values are missing.


In [44]:
# Detect missing values in the DataFrame
print(df.isnull())

    Name    Age   City  Salary  Gender
0  False  False  False   False   False
1  False  False  False   False   False
2  False  False   True    True    True
3  False  False  False   False   False
4   True  False  False   False   False
5  False  False  False   False   False


In [39]:
# Count missing values in each column
print(df.isnull().sum())

Name      1
Age       1
City      1
Salary    1
Gender    1
dtype: int64


In [40]:
# Detect existing (non-missing) values in the DataFrame
print(df.notnull())

    Name    Age   City  Salary  Gender
0   True   True   True    True    True
1   True   True   True    True    True
2   True   True  False   False   False
3   True  False   True    True    True
4  False   True   True    True    True
5   True   True   True    True    True


In [41]:
# Remove missing values from the DataFrame
df_no_na = df.dropna()
print(df_no_na)

    Name   Age           City   Salary  Gender
0  Alice  25.0       New York  25000.0  Female
1    Bob  30.0  San Francisco  27000.0    Male
5  Alice  25.0       New York  25000.0  Female


In [42]:
# Fill missing values in 'column' with a specified value
df['Age'] = df['Age'].fillna(0)
print(df)

      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0  Female
1      Bob  30.0  San Francisco  27000.0    Male
2  Charlie  35.0           None      NaN    None
3    David   0.0    Los Angeles  23000.0    Male
4     None  40.0        Chicago  37500.0    Male
5    Alice  25.0       New York  25000.0  Female


In [43]:
# Fill missing values with a specified value (0)
df_filled = df.fillna(0)
print(df_filled)

      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0  Female
1      Bob  30.0  San Francisco  27000.0    Male
2  Charlie  35.0              0      0.0       0
3    David   0.0    Los Angeles  23000.0    Male
4        0  40.0        Chicago  37500.0    Male
5    Alice  25.0       New York  25000.0  Female


In [None]:
# Get rows where 'Age' column values are null
missing_data = df[df['Age'].isna()]
print(missing_data)

In [None]:
# Replace null values in 'Age' column with zero
df['Age'].fillna(value=0, inplace=True)
# Replace null values in 'Salary' column with mean of the column
df['Salary'].fillna(value=df['Salary'].mean(), inplace=True)
print(df)


In [45]:
# Drop rows where any of 'Age' or 'Name' column value is missing
df.dropna(subset=['Age', 'Name'], how='any', inplace=True)
print(df)

      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0  Female
1      Bob  30.0  San Francisco  27000.0    Male
2  Charlie  35.0           None      NaN    None
3    David   0.0    Los Angeles  23000.0    Male
5    Alice  25.0       New York  25000.0  Female


### Removing Duplicates

Duplicate data can cause problems in data analysis.  
* Inflating the sample size.
* Biasing the results of the analysis.
* __.duplicated()__ identifies duplicate rows in a DataFrame.
    * __keep='first’__ keeps the first observed row, marks later ones as duplicates (this is the default).
    * __keep='last’__ keeps the last observed duplicate row.
    * __keep=False__ marks all duplicates.
* __.drop_duplicates()__ removes duplicate rows from the DataFrame.


In [46]:
# Get duplicated rows based on all columns
duplicates = df[df.duplicated()]
# Get duplicated rows based on 'Name' and 'Age' columns
duplicates = df[df.duplicated(subset=['Name', 'Age'])]
print(duplicates)

    Name   Age      City   Salary  Gender
5  Alice  25.0  New York  25000.0  Female


In [47]:
# Check count of duplicated rows based on 'Name' and 'Age' columns
print(df.duplicated(subset=['Name', 'Age']).sum())
# Drop duplicated rows and check the resulting shape
print(df.drop_duplicates().shape)
# If satisfied with the resulting shape, save changes
df.drop_duplicates(inplace=True)
print(df)

1
(4, 5)
      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0  Female
1      Bob  30.0  San Francisco  27000.0    Male
2  Charlie  35.0           None      NaN    None
3    David   0.0    Los Angeles  23000.0    Male


### Converting Data Types

Correct data types are crucial for data analysis. Pandas provides methods to convert data types as needed.  

__.astype()__
* Change the data type of a specific column.  
* Useful for:
    * Converting booleans to int (True to 1, False to 0).  
    * Or for encoding text categories  
        * ‘Female’ and ‘Male’ to ‘0’ and ‘1’.  
        * City name ‘New York’ to '1''.  

In [48]:
# Check data types of all columns
print(df.dtypes)
# Change 'Age' column data type to float
df['Age'] = df['Age'].astype('float')
# Convert 'Gender' column values 'Male' to 1 and others to 0
df['Gender'] = pd.Series(df['Gender'] == 'Male').astype('int')
print(df)

Name       object
Age       float64
City       object
Salary    float64
Gender     object
dtype: object
      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0       0
1      Bob  30.0  San Francisco  27000.0       1
2  Charlie  35.0           None      NaN       0
3    David   0.0    Los Angeles  23000.0       1


In [49]:
# Check data types of all columns
print(df.dtypes)

# Change 'Salary' column data type to float
df['Salary'] = df['Salary'].astype('float')

# Convert 'City' column values 'New York' to 1 and others to 0
df['City'] = pd.Series(df['City'] == 'New York').astype('int')

print(df)


Name       object
Age       float64
City       object
Salary    float64
Gender      int32
dtype: object
      Name   Age  City   Salary  Gender
0    Alice  25.0     1  25000.0       0
1      Bob  30.0     0  27000.0       1
2  Charlie  35.0     0      NaN       0
3    David   0.0     0  23000.0       1


### String Operations

Pandas offers string operations;  
* __.str__ accessor for object data type columns.  
* __.str.lower()__ and __.str.upper()__ to convert strings to lowercase or uppercase.
* __.str.replace()__ replaces substrings within strings.  

In [57]:
# Remake the DataFrame after editing the indexes
s_data = {
 "Name": ["Alice", "Bob"],
 "Age": [25, 30],
 "City": ["New York", "San Francisco"],
 "Salary": ['$25000', '$27000'],
 "Gender": ["Female", "Male"]
}
s_df = pd.DataFrame(s_data)
print(s_df)



    Name  Age           City  Salary  Gender
0  Alice   25       New York  $25000  Female
1    Bob   30  San Francisco  $27000    Male


In [58]:
# Convert 'Salary' column values from object (string) to float
s_df['Salary'] = s_df['Salary'].str.replace('$','').astype('float')
print(s_df)

    Name  Age           City   Salary  Gender
0  Alice   25       New York  25000.0  Female
1    Bob   30  San Francisco  27000.0    Male


More string operations:  
* __.str.contains()__  
    * Check if a specific substring or pattern exists within a string.  
    * Returns a boolean Series.  
    * Indicates whether each element contains the specified pattern.  
* __.str.slice()__  
    * Extract a substring from each string in a Series.  
    * Specify the start and end positions to define the slice.  
    * Python slicing is inclusive of start and exclusive of stop.  

In [59]:
# Get rows where 'Name' column contains 'Alice'
contains_pattern = df['Name'].str.contains('Alice')
filtered_data = df[contains_pattern]
print(filtered_data)

    Name   Age  City   Salary  Gender Substring
0  Alice  25.0     1  25000.0       0       ice


In [60]:
# Extracts characters at indices 2, 3, and 4 from each string in 'Name' column
df['Substring'] = df['Name'].str.slice(start=2, stop=5)
print(df)

      Name   Age  City   Salary  Gender Substring
0    Alice  25.0     1  25000.0       0       ice
1      Bob  30.0     0  27000.0       1         b
2  Charlie  35.0     0      NaN       0       arl
3    David   0.0     0  23000.0       1       vid


### TASK 3: Data cleaning in Pandas  
* Complete the exercises in the __Task 3__ section.  
* OPTIONAL: Find your own data and apply some of the data cleaning techniques to a new dataset.
If you don't have any data to hand, try these resources:  
    * https://www.kaggle.com/datasets?fileType=csv  
    * https://datasetsearch.research.google.com/ 
    
    

# Data Manipulation

A core task in data analysis, it involves transforming and modifying data to derive insights or prepare it for further analysis.

Pandas:
* provides a rich set of methods for data manipulation.  
* empowers you to shape your data to meet your specific needs.  
    * Filtering  
    * Transformation  
    * Aggregation  
    * Sorting  


In [61]:
# Create DataFrame
data = {
 "Name": ["Alice", "Bob", "Charlie", "David"],
 "Age": [25, 30, 35, None],
 "City": [1, 0, 0, 0],
 "Salary": [25000, 27000, 27500, 23000],
 "Gender": [0, 1, 0, 1]
}
df = pd.DataFrame(data)
print(df)

      Name   Age  City  Salary  Gender
0    Alice  25.0     1   25000       0
1      Bob  30.0     0   27000       1
2  Charlie  35.0     0   27500       0
3    David   NaN     0   23000       1


In [62]:
# Filtering
# This line filters rows where 'Age' is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)

      Name   Age  City  Salary  Gender
1      Bob  30.0     0   27000       1
2  Charlie  35.0     0   27500       0


In [63]:
# Transformation
# This line applies a function to 'Salary' column that increases each salary by 10%
df['Salary'] = df['Salary'].apply(lambda x: x*1.1)
print(df)

      Name   Age  City   Salary  Gender
0    Alice  25.0     1  27500.0       0
1      Bob  30.0     0  29700.0       1
2  Charlie  35.0     0  30250.0       0
3    David   NaN     0  25300.0       1


In [64]:
# Aggregation
# This line computes the mean of 'Salary' column
mean_salary = df['Salary'].mean()
print(mean_salary)

28187.500000000004


In [66]:
# Sorting
# This line sorts DataFrame by 'Age' column in descending order
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

      Name   Age  City   Salary  Gender
2  Charlie  35.0     0  30250.0       0
1      Bob  30.0     0  29700.0       1
0    Alice  25.0     1  27500.0       0
3    David   NaN     0  25300.0       1


## Applying functions to dataframes    

__Avoid using .iterrows() for element-wise operations due to inefficiency.__  
Use vectorised operations whenever possible.
Pandas is optimised to use vector-wise operations.  

* __apply__  
    * Apply a custom function to a Series, or to the entire DataFrame.
    * Series: each element of the original column is passed to the function.
    * DataFrame: based on the axis (1 - row, 0 – column), an entire row or column is passed to the function.

* __map__  
    * Applies a function to each element of a Series or DataFrame.
        * *applymap() used to be the DataFrame function.*
        * *map() now works on Series and DataFrame.*
        * *applymap() might still need to be used in older Pandas versions.*
    * Useful for transforming one column based on values from another.  

In [67]:
data = {
 "Name": ["Alice", "Bob", "Charlie", "David", None],
 "Age": [25, 30, 35, None, 40],
 "City": ["New York", "San Francisco", None, "Los Angeles", "Chicago"],
 "Salary": [25000, 27000, None, 23000, 37500],
 "Gender": ["Female", "Male", None, "Male", "Male"]
}
df = pd.DataFrame(data)

print(df)

      Name   Age           City   Salary  Gender
0    Alice  25.0       New York  25000.0  Female
1      Bob  30.0  San Francisco  27000.0    Male
2  Charlie  35.0           None      NaN    None
3    David   NaN    Los Angeles  23000.0    Male
4     None  40.0        Chicago  37500.0    Male


In [None]:
# Define a function to calculate age in months
def age_in_months(x):
    if isinstance(x, (int, float)):
        return x * 12
    return x

# Apply the function to the 'Age' column
df['Age_in_Months'] = df['Age'].apply(age_in_months)

print(df)

In [None]:
# Define a function to map genders to numerical values
mapping_dict = {'Female': 0, 'Male': 1}

# Apply the mapping to the 'Gender' column
df['Gender_Numerical'] = df['Gender'].map(mapping_dict)
print(df)

In [None]:
# Define a function to replace None values with 'Unknown'
def replace_none(x):
    if x is None:
        return 'Unknown'
    return x

# Apply the function to the entire DataFrame
df = df.map(replace_none)

print(df)


# Data joins 

### Concatenate

Concatenate DataFrames vertically or horizontally.  
* __pd.concat()__  
    * __axis=0__ 
        * Concatenates in the rows.
        * Adds the rows of the second DataFrame to the end of the first DataFrame.
    * __axis=1__
        * Concatenates in the columns.
        * Adds the columns of the second DataFrame to the end of the first DataFrame.
        * Checks for common columns between both DataFrames. or common indices.
        * For the matched columns, it concatenates in the rows.
        * If there are columns or indices that do not match, they will be filled with NaN values.


In [68]:
# Sample DataFrame 1 
data1 = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[1,2,3]}
df1 = pd.DataFrame(data1)
print(df1)

   A  B  C
0  1  4  1
1  2  5  2
2  3  6  3


In [69]:
# Sample DataFrame 2 with the same column names as DataFrame 1
data2 = {'A': [7, 8, 9], 'B': [10, 11, 12], 'D':[1,2,3]}
df2 = pd.DataFrame(data2)
print(df2)

   A   B  D
0  7  10  1
1  8  11  2
2  9  12  3


In [74]:
# Concatenate df1 and df2 horizontally (along columns) with same column names
result = pd.concat([df1, df2], axis=0)
result

Unnamed: 0,A,B,C
0,1.0,4,
1,2.0,5,
2,3.0,6,
0,,7,10.0
1,,8,11.0
2,,9,12.0


In [76]:
# Sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})

# Concatenate df1 and df2 along rows
result = pd.concat([df1, df2], axis=1)

result

Unnamed: 0,A,B,C,D
0,1,4,7,10
1,2,5,8,11
2,3,6,9,12


### Merge

Merging is the process of combining two or more DataFrames based on a common column or index, similar to SQL joins.

* __.merge()__  
    * SQL-like joins on DataFrames.
    * __how__ - specify the type of join using the how parameter (inner, outer, left, right).
    * __on__ - is used to specify the common column(s).
    * If the common columns have different names in the DataFrames, we use __left_on__ and __right_on__ parameters.  
    * We can join on index instead of columns using __left_index=True__ or __right_index=True__.  
    * Results in a new DataFrame with all records that satisfy the join condition. Non-matching records are filled with NaN.


#### Inner join

* Returns only rows where there is a match in both DataFrames.
* Only keep where the cells of common columns have matched.
* Non-matching rows are not included in the output.
* Use Case:  
    * When you only want rows with data in both dataframes.
    * Useful for eliminating rows with missing or mismatched data.

In [82]:
data1 = {
 "Name": ["Alice", "Bob", "Charlie", "David", None],
 "Age": [25, 30, 35, None, 40]
}
df1 = pd.DataFrame(data1)
df1 

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
2,Charlie,35.0
3,David,
4,,40.0


In [83]:
data2 = {
 "Name": ["Alice", "Bob", "Eve", None],
 "City": ["New York", "San Francisco", "London", "Los Angeles"]
}
df2 = pd.DataFrame(data2)
df2

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,San Francisco
2,Eve,London
3,,Los Angeles


In [84]:
# Inner join on 'Name'
result = pd.merge(df1, df2, on='Name', how='inner')
print(result)

    Name   Age           City
0  Alice  25.0       New York
1    Bob  30.0  San Francisco
2   None  40.0    Los Angeles


#### Left outer join

* Returns all rows from both DataFrames, matching records from both sides where available.  
* If there is no match, the result is NaN.  
* Use Case:  
    * When you want to keep all rows from both DataFrames.
    * Useful to combine DataFrames without losing any information, even when some records do not have a match in the other DataFrame.

In [85]:
data1 = {
 "Name": ["Alice", "Bob", "Charlie", "David", None],
 "Age": [25, 30, 35, None, 40]
}
df1 = pd.DataFrame(data1)
df1 

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
2,Charlie,35.0
3,David,
4,,40.0


In [86]:
data2 = {
 "Name": ["Alice", "Bob", "Eve", None],
 "City": ["New York", "San Francisco", "London", "Los Angeles"]
}
df2 = pd.DataFrame(data2)
df2 

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,San Francisco
2,Eve,London
3,,Los Angeles


In [87]:
# Left join on 'Name'
result = pd.merge(df1, df2, on='Name', how='outer')
print(result)

      Name   Age           City
0    Alice  25.0       New York
1      Bob  30.0  San Francisco
2  Charlie  35.0            NaN
3    David   NaN            NaN
4     None  40.0    Los Angeles
5      Eve   NaN         London


#### Joining on different column names 

* When the common columns have different names in the DataFrames,  we specify them manually:
    * __left_on=column_name__  
    * __right_on=column_name__ 
* Use Case
    * When you have two DataFrames with a common concept represented in different column names.
        * DataFrame 1: ‘Employee’ column  
        * DataFrame 2: ‘Name’ column  
    * If these columns represent the same concept, i.e., Employee name, use __left_on__ and __right_on__ to join these DataFrames based on this common concept.

In [88]:
data1 = {
 "Name": ["Alice", "Bob", "Charlie", "David", None],
 "Age": [25, 30, 35, None, 40]
}
df1 = pd.DataFrame(data1)
df1 

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
2,Charlie,35.0
3,David,
4,,40.0


In [89]:
data2 = {
 "Employee": ["Alice", "Bob", "Eve", None],
 "City": ["New York", "San Francisco", "London", "Los Angeles"]
}
df2 = pd.DataFrame(data2)
df2 

Unnamed: 0,Employee,City
0,Alice,New York
1,Bob,San Francisco
2,Eve,London
3,,Los Angeles


In [90]:
# Join on different column names
result = pd.merge(df1, df2, left_on="Name", right_on="Employee")
print(result)

    Name   Age Employee           City
0  Alice  25.0    Alice       New York
1    Bob  30.0      Bob  San Francisco
2   None  40.0     None    Los Angeles


#### Joining on index

* Match a table column with the index of another table.  
    * __left_index=True__  
    * __right_index=True__ 
* Use Case
    * When one DataFrames index corresponds to a column in another DataFrame, i.e. DataFrame 1 index: is Employee IDs and DataFrame 2: has an ‘ID’ column with employee IDs, join these DataFrames based on the common concept.
    * Useful when you want to align the data based on the index values instead of aligning them based on matching column values.

In [91]:
data1 = {
 "ID": [1, 2, 3],
 "Name": ["Alice", "Bob", "Charlie"]
}
df1 = pd.DataFrame(data1)
df1 

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie


In [92]:
data2 = {
 "City": ["New York", "San Francisco", "Los Angeles"],
 "Salary": [25000, 27000, 23000]
}
df2 = pd.DataFrame(data2, index=[2, 3, 4])
df2 

Unnamed: 0,City,Salary
2,New York,25000
3,San Francisco,27000
4,Los Angeles,23000


In [93]:
# Joining on Index
# Match the left dataframe 'ID' column with right dataframe index
result = pd.merge(df1, df2, left_on="ID", right_index=True)
print(result)

   ID     Name           City  Salary
1   2      Bob       New York   25000
2   3  Charlie  San Francisco   27000


### TASKS 4 and 5: Data manipulation and joins  
* Complete the exercises in the __Task 4__ and __Task 5__ sections.  