<a href="https://colab.research.google.com/github/kalashjain9/Python-Notes/blob/main/Python_Notes_Day_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Pandas Basics: A Deep Dive for Beginners and Beyond***

image.png

# ***1. Why Use Pandas?***

Imagine you have a messy spreadsheet with data about your friends' social habits. Some cells are empty, some have strange entries, and you want to analyze it to see if there's a pattern between their social media posts and their personality. This is where Pandas comes in.

***Data Manipulation and Analysis***: Pandas is a powerful Python library built for handling and analyzing structured data, like the kind you find in tables (spreadsheets, SQL tables, etc.). It's your go-to tool for cleaning, transforming, and exploring data.

***Ease of Use***: It provides user-friendly data structures and functions that make complex data operations feel intuitive.

# ***2. Pandas vs. NumPy: What's the Difference?***

Think of a toolbox.

# ***NumPy***:
It's like a set of screwdrivers and wrenches. It's excellent for working with raw numbers and mathematical calculations, especially in high-performance scientific computing. Its core data structure is the ndarray (N-dimensional array), which is a grid of values of the same type.

# ***Pandas***:
It's like a complete workbench with a saw, a hammer, a drill, and a toolbox organizer. It's built on top of NumPy and adds a layer of intelligence for tabular data. Its main data structures, Series and DataFrame, have labels for rows and columns, which makes them much more powerful for real-world data analysis.

***Analogy***: NumPy is great for doing the math on a bunch of numbers in a list. Pandas is great for organizing those numbers into a neat table with headings and then doing the math on specific columns.

# ***3. The Pandas Series***
A Series is like a single column in a spreadsheet. It's a one-dimensional array-like object that can hold various data types (integers, floats, strings, Python objects, etc.).

***Theory***:

It's a labeled array. The labels are called the index. If you don't provide one, Pandas will create a default integer index (0, 1, 2, ...).

It's a fundamental building block of a DataFrame.

# ***Example 1: Creating a Series***
Let's create a Series for the Social_event_attendance from your dataset.

In [1]:
import pandas as pd

# Creating a Series from a list of values
social_events_attendance = pd.Series([8, 2, 9, 1, 5, 0, 10])

print("Our first Pandas Series:")
print(social_events_attendance)

Our first Pandas Series:
0     8
1     2
2     9
3     1
4     5
5     0
6    10
dtype: int64


***Fun Fact***: The left side (0, 1, 2, ...) is the index, and the right side (8, 2, 9, ...) is the data.

# ***Example 2: Creating a Series with a custom index***
Let's give our data some meaningful labels.

In [4]:
# Creating a Series with a custom index (e.g., person's name)
social_events_attendance_labeled = pd.Series([8, 2, 9, 1, 5], index=['Alice', 'Bob', 'Charlie', 'David', 'Eve'])

print("\nSeries with custom labels:")
print(social_events_attendance_labeled)


Series with custom labels:
Alice      8
Bob        2
Charlie    9
David      1
Eve        5
dtype: int64


# ***Example 3: Accessing elements in a Series***
You can access elements using their index, just like with a list.

In [5]:
# Accessing elements by position (index)
print(f"\nAttendance of the first person: {social_events_attendance[0]}")



Attendance of the first person: 8


In [6]:
# Accessing elements by label
print(f"Attendance of Charlie: {social_events_attendance_labeled['Charlie']}")


Attendance of Charlie: 9


In [7]:
# Accessing multiple elements using a list of labels
print("\nAttendance of Alice and Eve:")
print(social_events_attendance_labeled[['Alice', 'Eve']])


Attendance of Alice and Eve:
Alice    8
Eve      5
dtype: int64


# ***Example 4: Modifying a Series***
Let's say David's attendance was a mistake and it should be 4.

In [8]:
print("\nBefore modification:")
print(social_events_attendance_labeled)

# Modifying a value using its label
social_events_attendance_labeled['David'] = 4

print("\nAfter modification (David's attendance is now 4):")
print(social_events_attendance_labeled)


Before modification:
Alice      8
Bob        2
Charlie    9
David      1
Eve        5
dtype: int64

After modification (David's attendance is now 4):
Alice      8
Bob        2
Charlie    9
David      4
Eve        5
dtype: int64


# ***Example 5: Series from a dictionary***
A dictionary is a perfect way to create a Series where the keys become the index.

In [9]:
# Creating a Series from a dictionary
friends_circle_data = {'Alice': 12, 'Bob': 3, 'Charlie': 10, 'David': 1, 'Eve': 8}
friends_circle_size = pd.Series(friends_circle_data)

print("\nFriends Circle Size Series:")
print(friends_circle_size)


Friends Circle Size Series:
Alice      12
Bob         3
Charlie    10
David       1
Eve         8
dtype: int64


# ***Example 6: Operations on a Series***
You can perform mathematical operations on an entire Series at once.

In [10]:
# Let's say everyone attended 2 more events.
new_attendance = social_events_attendance_labeled + 2

print("\nAttendance after attending 2 more events:")
print(new_attendance)


Attendance after attending 2 more events:
Alice      10
Bob         4
Charlie    11
David       6
Eve         7
dtype: int64


***Real-life Analogy***: A Series is like a single column in your "Personality Data" spreadsheet, like the "Time_spent_Alone" column. It has a label at the top (the column name) and a list of values down the column.

# ***4. The Pandas DataFrame***
A DataFrame is the most important data structure in Pandas. It's a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

***Theory:***

Think of it as a spreadsheet or a SQL table.

It's a collection of Series objects, where each Series is a column.

It has both a row index and a column index.

# ***Example 1: Creating a DataFrame***
Let's create a small DataFrame from scratch using a dictionary of lists.

In [11]:
# Creating a DataFrame from a dictionary of lists
data = {
    'Time_spent_Alone': [8, 1, 9, 2],
    'Stage_fear': ['Yes', 'No', 'Yes', 'No'],
    'Personality': ['Introvert', 'Extrovert', 'Introvert', 'Extrovert']
}

# The keys of the dictionary become the column names
small_df = pd.DataFrame(data)

print("Our first DataFrame:")
print(small_df)

Our first DataFrame:
   Time_spent_Alone Stage_fear Personality
0                 8        Yes   Introvert
1                 1         No   Extrovert
2                 9        Yes   Introvert
3                 2         No   Extrovert


# ***Example 2: Indexing and Slicing a DataFrame***
This is how you get specific pieces of data.

***Accessing a single column***: Use bracket notation `df['column_name']`. This returns a Series.

***Accessing multiple columns***: Use a list of column names `df[['col1', 'col2']]`. This returns a DataFrame.

In [13]:
# Accessing a single column (returns a Series)
time_alone_series = small_df['Time_spent_Alone']
print("\n'Time_spent_Alone' column (as a Series):")
print(time_alone_series)



'Time_spent_Alone' column (as a Series):
0    8
1    1
2    9
3    2
Name: Time_spent_Alone, dtype: int64


In [14]:
# Accessing multiple columns (returns a DataFrame)
subset_df = small_df[['Personality', 'Stage_fear']]
print("\n'Personality' and 'Stage_fear' columns (as a DataFrame):")
print(subset_df)


'Personality' and 'Stage_fear' columns (as a DataFrame):
  Personality Stage_fear
0   Introvert        Yes
1   Extrovert         No
2   Introvert        Yes
3   Extrovert         No


# ***Example 3: Adding and Removing Columns***
Adding a new column is as easy as assigning a Series to a new column name.

In [15]:
# Adding a new column: 'Drained_after_socializing'
small_df['Drained_after_socializing'] = ['Yes', 'No', 'Yes', 'No']

print("\nDataFrame with a new column:")
print(small_df)



DataFrame with a new column:
   Time_spent_Alone Stage_fear Personality Drained_after_socializing
0                 8        Yes   Introvert                       Yes
1                 1         No   Extrovert                        No
2                 9        Yes   Introvert                       Yes
3                 2         No   Extrovert                        No


In [16]:
# Removing a column
# 'inplace=True' modifies the DataFrame directly, otherwise it returns a new DataFrame
small_df.drop('Stage_fear', axis=1, inplace=True)

print("\nDataFrame after dropping 'Stage_fear' column:")
print(small_df)


DataFrame after dropping 'Stage_fear' column:
   Time_spent_Alone Personality Drained_after_socializing
0                 8   Introvert                       Yes
1                 1   Extrovert                        No
2                 9   Introvert                       Yes
3                 2   Extrovert                        No


In [20]:
# Removing a column
# 'inplace=True' modifies the DataFrame directly, otherwise it returns a new DataFrame
new_df = small_df.drop('Personality', axis=1)

print("\nNew DataFrame after dropping 'Personality' column:")
print(new_df)

print("\nOriginal DataFrame after dropping 'Personality' column:")
print(small_df)


New DataFrame after dropping 'Personality' column:
   Time_spent_Alone Drained_after_socializing
0                 8                       Yes
1                 1                        No
2                 9                       Yes
3                 2                        No

Original DataFrame after dropping 'Personality' column:
   Time_spent_Alone Personality Drained_after_socializing
0                 8   Introvert                       Yes
1                 1   Extrovert                        No
2                 9   Introvert                       Yes
3                 2   Extrovert                        No


***Fun Fact***: **axis=1** means "columns," and **axis=0** means "rows."

# ***Example 4: Adding and Removing Rows***
You can add rows using `pd.concat()`. Removing rows is done with `df.drop()`.

In [25]:
# Adding a new row using a dictionary and pd.concat()
new_person = pd.DataFrame([{'Time_spent_Alone': 5, 'Personality': 'Introvert', 'Drained_after_socializing': 'Yes'}])
updated_df = pd.concat([small_df, new_person], ignore_index=True)

print("\nDataFrame after adding a new row:")
print(updated_df)



DataFrame after adding a new row:
   Time_spent_Alone Personality Drained_after_socializing
0                 8   Introvert                       Yes
1                 1   Extrovert                        No
2                 9   Introvert                       Yes
3                 2   Extrovert                        No
4                 5   Introvert                       Yes


In [26]:
# Removing a row by its index
updated_df.drop(0, axis=0, inplace=True) # Drops the first row (index 0)

print("\nDataFrame after dropping the first row:")
print(updated_df)


DataFrame after dropping the first row:
   Time_spent_Alone Personality Drained_after_socializing
1                 1   Extrovert                        No
2                 9   Introvert                       Yes
3                 2   Extrovert                        No
4                 5   Introvert                       Yes


# ***Reading Data from Files***

This is the most common way to get data into a DataFrame.

# ***A. Mounting Google Drive (for Google Colab):***
This is a standard practice if your data is on your Google Drive.

In [27]:
# Run this cell in Google Colab to mount your Drive
from google.colab import drive
drive.mount('/content/drive')

# Now you can read the file from your Drive
# Make sure to update the path to where your CSV is located!
# For example, if it's in a folder called 'data' in 'My Drive'

Mounted at /content/drive


In [42]:
df = pd.read_csv('/content/drive/MyDrive/Introvert Extrovert Dataset/personality_dataset_1.csv')
df

Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,4.0,No,4.0,6.0,No,13.0,5.0,Extrovert
1,9.0,Yes,0.0,0.0,Yes,0.0,3.0,Introvert
2,9.0,Yes,1.0,2.0,Yes,5.0,2.0,Introvert
3,0.0,No,6.0,7.0,No,14.0,8.0,Extrovert
4,3.0,No,9.0,4.0,No,8.0,5.0,Extrovert
...,...,...,...,...,...,...,...,...
2895,3.0,No,7.0,6.0,No,6.0,6.0,Extrovert
2896,3.0,No,8.0,3.0,No,14.0,9.0,Extrovert
2897,4.0,Yes,1.0,1.0,Yes,4.0,0.0,Introvert
2898,11.0,Yes,1.0,,Yes,2.0,0.0,Introvert


# ***B. Uploading a File:***
This is useful if you just want to quickly upload a file for a session.

# ***To read a CSV file from an uploaded file in Google Colab, follow these steps***:

***Import the files module from google.colab and the pandas library***:

    from google.colab import files
    import pandas as pd

***Use files.upload() to initiate the upload process:***

    uploaded = files.upload()

This command will open a file browser dialog in your local system, allowing you to select the CSV file you want to upload to your Colab environment. Identify the uploaded file's name.
After the upload is complete, the uploaded variable will contain a dictionary where keys are the uploaded filenames and values are the file contents. You can get the filename by accessing the keys of this dictionary.

    for filename in uploaded.keys():
        print(f'Uploaded file: {filename}')
        uploaded_filename = filename # Store the filename for later use

***Read the CSV file into a Pandas DataFrame***:
Use pd.read_csv() and pass the uploaded_filename to it.

    df = pd.read_csv(uploaded_filename)

In [31]:
# Run this cell in Google Colab to get a file upload widget
from google.colab import files
uploaded = files.upload()

for filename in uploaded.keys():
    print(f'Uploaded file: {filename}')
    uploaded_filename = filename # Store the filename for later use

df_1 = pd.read_csv(uploaded_filename)
df_1

Saving CarPrice_Assignment.csv to CarPrice_Assignment (1).csv
Uploaded file: CarPrice_Assignment (1).csv


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,201,-1,volvo 145e (sw),gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845.0
201,202,-1,volvo 144ea,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045.0
202,203,-1,volvo 244dl,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485.0
203,204,-1,volvo 246,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470.0


# ***C. Reading from a local file:***
This is for when you are running Python on your own computer.

In [32]:
# Assuming your CSV is in the same folder as your script
df_2 = pd.read_csv('/content/colors.csv')
df_2


Unnamed: 0,id,name,rgb,is_trans
0,-1,Unknown,0033B2,f
1,0,Black,05131D,f
2,1,Blue,0055BF,f
3,2,Green,237841,f
4,3,Dark Turquoise,008F9B,f
...,...,...,...,...
130,1004,Trans Flame Yellowish Orange,FCB76D,t
131,1005,Trans Fire Yellow,FBE890,t
132,1006,Trans Light Royal Blue,B4D4F7,t
133,1007,Reddish Lilac,8E5597,f


# ***Viewing Data***

These are your best friends for quick data inspection.

In [43]:
# Assuming you have loaded your dataset into a DataFrame called 'df'

# df.head(): Shows the first 5 rows (by default)
print("\nFirst 5 rows of the DataFrame:")
print(df.head())



First 5 rows of the DataFrame:
   Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0               4.0         No                      4.0            6.0   
1               9.0        Yes                      0.0            0.0   
2               9.0        Yes                      1.0            2.0   
3               0.0         No                      6.0            7.0   
4               3.0         No                      9.0            4.0   

  Drained_after_socializing  Friends_circle_size  Post_frequency Personality  
0                        No                 13.0             5.0   Extrovert  
1                       Yes                  0.0             3.0   Introvert  
2                       Yes                  5.0             2.0   Introvert  
3                        No                 14.0             8.0   Extrovert  
4                        No                  8.0             5.0   Extrovert  


In [44]:
# df.tail(): Shows the last 5 rows (by default)
print("\nLast 5 rows of the DataFrame:")
print(df.tail())



Last 5 rows of the DataFrame:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
2895               3.0         No                      7.0            6.0   
2896               3.0         No                      8.0            3.0   
2897               4.0        Yes                      1.0            1.0   
2898              11.0        Yes                      1.0            NaN   
2899               3.0         No                      6.0            6.0   

     Drained_after_socializing  Friends_circle_size  Post_frequency  \
2895                        No                  6.0             6.0   
2896                        No                 14.0             9.0   
2897                       Yes                  4.0             0.0   
2898                       Yes                  2.0             0.0   
2899                        No                  6.0             9.0   

     Personality  
2895   Extrovert  
2896   Extrovert  
2897   Introvert  
289

In [45]:
# df.info(): Provides a concise summary of the DataFrame
print("\nSummary of the DataFrame's structure:")
df.info()



Summary of the DataFrame's structure:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2900 entries, 0 to 2899
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Time_spent_Alone           2837 non-null   float64
 1   Stage_fear                 2827 non-null   object 
 2   Social_event_attendance    2838 non-null   float64
 3   Going_outside              2834 non-null   float64
 4   Drained_after_socializing  2848 non-null   object 
 5   Friends_circle_size        2823 non-null   float64
 6   Post_frequency             2835 non-null   float64
 7   Personality                2900 non-null   object 
dtypes: float64(5), object(3)
memory usage: 181.4+ KB


In [46]:
# df.describe(): Provides descriptive statistics for numerical columns
print("\nDescriptive statistics for numerical columns:")
print(df.describe())


Descriptive statistics for numerical columns:
       Time_spent_Alone  Social_event_attendance  Going_outside  \
count       2837.000000              2838.000000    2834.000000   
mean           4.505816                 3.963354       3.000000   
std            3.479192                 2.903827       2.247327   
min            0.000000                 0.000000       0.000000   
25%            2.000000                 2.000000       1.000000   
50%            4.000000                 3.000000       3.000000   
75%            8.000000                 6.000000       5.000000   
max           11.000000                10.000000       7.000000   

       Friends_circle_size  Post_frequency  
count          2823.000000     2835.000000  
mean              6.268863        3.564727  
std               4.289693        2.926582  
min               0.000000        0.000000  
25%               3.000000        1.000000  
50%               5.000000        3.000000  
75%              10.000000        

***Real-life Analogy***: A DataFrame is your entire "Personality Data" spreadsheet. Each column is a Series, and it has both row numbers (the index) and column headers (the column names).

# ***Data Cleaning, Grouping, and Merging***
Now that you know the basics, let's get our hands dirty with real-world data problems, like the missing values in your dataset.

# ***1. Handling Missing Values (NaN)***
Missing values are represented as NaN (Not a Number) in Pandas.

***Theory***:

`df.isnull()`: This returns a boolean DataFrame of the same size, where True indicates a missing value.

`df.dropna()`: This removes rows or columns with missing values.

`df.fillna(value)`: This fills missing values with a specified value.

# ***Example 1: Finding Missing Values***
Let's use our df with a missing value in Time_spent_Alone.

In [47]:
print("DataFrame with a missing value:")
print(df)
print("\nChecking for null values:")
print(df.isnull())


DataFrame with a missing value:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0                  4.0         No                      4.0            6.0   
1                  9.0        Yes                      0.0            0.0   
2                  9.0        Yes                      1.0            2.0   
3                  0.0         No                      6.0            7.0   
4                  3.0         No                      9.0            4.0   
...                ...        ...                      ...            ...   
2895               3.0         No                      7.0            6.0   
2896               3.0         No                      8.0            3.0   
2897               4.0        Yes                      1.0            1.0   
2898              11.0        Yes                      1.0            NaN   
2899               3.0         No                      6.0            6.0   

     Drained_after_socializing  Friends_cir

In [48]:
# You can also count the number of missing values per column
print("\nNumber of missing values per column:")
print(df.isnull().sum())


Number of missing values per column:
Time_spent_Alone             63
Stage_fear                   73
Social_event_attendance      62
Going_outside                66
Drained_after_socializing    52
Friends_circle_size          77
Post_frequency               65
Personality                   0
dtype: int64


# ***Example 2: Dropping Rows with Missing Values***
This is a simple way to clean data, but be careful not to lose too much information.

In [49]:
# Drop all rows that contain at least one NaN value
df_dropped = df.dropna()

print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
print(f"Original size: {len(df)} rows, New size: {len(df_dropped)} rows")


DataFrame after dropping rows with missing values:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0                  4.0         No                      4.0            6.0   
1                  9.0        Yes                      0.0            0.0   
2                  9.0        Yes                      1.0            2.0   
3                  0.0         No                      6.0            7.0   
4                  3.0         No                      9.0            4.0   
...                ...        ...                      ...            ...   
2892               9.0        Yes                      2.0            0.0   
2895               3.0         No                      7.0            6.0   
2896               3.0         No                      8.0            3.0   
2897               4.0        Yes                      1.0            1.0   
2899               3.0         No                      6.0            6.0   

     Drained_after_soci

# ***Example 3: Filling Missing Values with a Specific Value***
Let's fill the None in Time_spent_Alone with a value of 0.

In [50]:
# Fill NaN with a specific value (e.g., 0)
df_filled_zero = df.fillna(0)

print("\nDataFrame after filling NaN with 0:")
print(df_filled_zero)


DataFrame after filling NaN with 0:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0                  4.0         No                      4.0            6.0   
1                  9.0        Yes                      0.0            0.0   
2                  9.0        Yes                      1.0            2.0   
3                  0.0         No                      6.0            7.0   
4                  3.0         No                      9.0            4.0   
...                ...        ...                      ...            ...   
2895               3.0         No                      7.0            6.0   
2896               3.0         No                      8.0            3.0   
2897               4.0        Yes                      1.0            1.0   
2898              11.0        Yes                      1.0            0.0   
2899               3.0         No                      6.0            6.0   

     Drained_after_socializing  Friend

# ***Example 4: Filling Missing Values with Mean, Median, or Mode***
This is a more intelligent way to impute missing data, especially for numerical columns.

In [51]:
# Calculate the mean of the column (excluding NaN)
mean_time_alone = df['Time_spent_Alone'].mean()
print(f"\nMean of 'Time_spent_Alone': {mean_time_alone:.2f}")

# Fill NaN with the mean
df_filled_mean = df.fillna({'Time_spent_Alone': mean_time_alone})

print("\nDataFrame after filling 'Time_spent_Alone' with the mean:")
print(df_filled_mean)


Mean of 'Time_spent_Alone': 4.51

DataFrame after filling 'Time_spent_Alone' with the mean:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0                  4.0         No                      4.0            6.0   
1                  9.0        Yes                      0.0            0.0   
2                  9.0        Yes                      1.0            2.0   
3                  0.0         No                      6.0            7.0   
4                  3.0         No                      9.0            4.0   
...                ...        ...                      ...            ...   
2895               3.0         No                      7.0            6.0   
2896               3.0         No                      8.0            3.0   
2897               4.0        Yes                      1.0            1.0   
2898              11.0        Yes                      1.0            NaN   
2899               3.0         No                      6.0  

# ***Fun Fact***:
For skewed data, the median is often a better choice than the mean. For categorical data, the mode (most frequent value) is a good choice.

# ***Sorting and Filtering***

# ***Sorting***:

`df.sort_values(by='column_name')` sorts the DataFrame by the values in a column.

In [53]:
# Sort the DataFrame by 'Time_spent_Alone' in ascending order
df_sorted_asc = df.sort_values(by='Time_spent_Alone')
print("\nDataFrame sorted by 'Time_spent_Alone' (ascending):")
print(df_sorted_asc)



DataFrame sorted by 'Time_spent_Alone' (ascending):
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
9                  0.0         No                      8.0            6.0   
3                  0.0         No                      6.0            7.0   
729                0.0         No                      7.0            6.0   
727                0.0         No                      4.0            6.0   
725                0.0         No                      5.0            4.0   
...                ...        ...                      ...            ...   
2705               NaN         No                      9.0            5.0   
2711               NaN         No                      4.0            6.0   
2715               NaN         No                      6.0            6.0   
2772               NaN         No                      7.0            6.0   
2787               NaN        NaN                      1.0            2.0   

     Drained_after_soc

In [54]:
# Sort in descending order
df_sorted_desc = df.sort_values(by='Social_event_attendance', ascending=False)
print("\nDataFrame sorted by 'Social_event_attendance' (descending):")
print(df_sorted_desc.head())


DataFrame sorted by 'Social_event_attendance' (descending):
    Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
30               4.0         No                     10.0            4.0   
72               2.0         No                     10.0            5.0   
97               3.0         No                     10.0            6.0   
98               4.0         No                     10.0            7.0   
57               2.0         No                     10.0            6.0   

   Drained_after_socializing  Friends_circle_size  Post_frequency Personality  
30                        No                  9.0            10.0   Extrovert  
72                        No                 12.0             5.0   Extrovert  
97                        No                 14.0             7.0   Extrovert  
98                        No                  8.0            10.0   Extrovert  
57                        No                 10.0             9.0   Extrovert  


# ***Filtering:***

This is how you select rows that meet a certain condition. The syntax is `df[condition]`.

In [55]:
# Filter for all people who are Introverts
introverts_df = df[df['Personality'] == 'Introvert']
print("\nData for all Introverts:")
print(introverts_df)



Data for all Introverts:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
1                  9.0        Yes                      0.0            0.0   
2                  9.0        Yes                      1.0            2.0   
8                 10.0        Yes                      1.0            3.0   
11                10.0        Yes                      3.0            1.0   
14                 6.0        Yes                      3.0            0.0   
...                ...        ...                      ...            ...   
2891               6.0        Yes                      3.0            1.0   
2892               9.0        Yes                      2.0            0.0   
2893               9.0        NaN                      2.0            0.0   
2897               4.0        Yes                      1.0            1.0   
2898              11.0        Yes                      1.0            NaN   

     Drained_after_socializing  Friends_circle_si

In [56]:
# Filter for people who have stage fear AND are Introverts
introverts_with_fear = df[(df['Personality'] == 'Introvert') & (df['Stage_fear'] == 'Yes')]
print("\nIntroverts with Stage Fear:")
print(introverts_with_fear)


Introverts with Stage Fear:
      Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
1                  9.0        Yes                      0.0            0.0   
2                  9.0        Yes                      1.0            2.0   
8                 10.0        Yes                      1.0            3.0   
11                10.0        Yes                      3.0            1.0   
14                 6.0        Yes                      3.0            0.0   
...                ...        ...                      ...            ...   
2890               8.0        Yes                      2.0            0.0   
2891               6.0        Yes                      3.0            1.0   
2892               9.0        Yes                      2.0            0.0   
2897               4.0        Yes                      1.0            1.0   
2898              11.0        Yes                      1.0            NaN   

     Drained_after_socializing  Friends_circle

***Real-life Analogy***: Filtering is like searching your spreadsheet for all the rows where "Personality" is "Extrovert." Sorting is like clicking the sort button on a column to arrange the data from smallest to largest.

# ***2. Groupby***
This is an incredibly powerful tool for summarizing data.

***Theory***:

It's a process of split-apply-combine.

You split the data into groups based on some criterion.

You apply a function (like mean(), sum(), count()) to each group.

You combine the results into a new summary table.

# ***Example 1: Grouping by Personality***
Let's see the average Time_spent_Alone for each personality type.

In [57]:
# Group the DataFrame by the 'Personality' column and calculate the mean of each group
avg_time_alone_by_personality = df.groupby('Personality')['Time_spent_Alone'].mean()

print("\nAverage Time Spent Alone by Personality Type:")
print(avg_time_alone_by_personality)


Average Time Spent Alone by Personality Type:
Personality
Extrovert    2.067261
Introvert    7.080435
Name: Time_spent_Alone, dtype: float64


# ***Example 2: Grouping and Aggregating Multiple Columns***
You can apply different aggregation functions to multiple columns.

In [58]:
# Group by 'Personality' and find the mean of 'Social_event_attendance'
# and the max of 'Friends_circle_size'
agg_results = df.groupby('Personality').agg({
    'Social_event_attendance': 'mean',
    'Friends_circle_size': 'max'
})

print("\nAggregated results by Personality:")
print(agg_results)


Aggregated results by Personality:
             Social_event_attendance  Friends_circle_size
Personality                                              
Extrovert                   6.016405                 15.0
Introvert                   1.778909                 14.0


# ***Combining DataFrames***

`pd.merge()`: Think of this as a JOIN in SQL. It combines DataFrames based on a common column (a key).

`pd.concat()`: Think of this as stacking DataFrames on top of each other (rows) or side-by-side (columns).

In [59]:
# Let's create two small DataFrames to merge
df_info = pd.DataFrame({'Person_ID': [1, 2, 3, 4],
                        'Name': ['Alex', 'Beth', 'Carl', 'Dana']})

df_scores = pd.DataFrame({'Person_ID': [2, 4, 1, 3],
                          'Social_Score': [85, 92, 78, 65],
                          'Solitude_Score': [15, 5, 22, 30]})

# Merge the two DataFrames on the common column 'Person_ID'
merged_df = pd.merge(df_info, df_scores, on='Person_ID')

print("\nMerged DataFrame based on 'Person_ID':")
print(merged_df)


Merged DataFrame based on 'Person_ID':
   Person_ID  Name  Social_Score  Solitude_Score
0          1  Alex            78              22
1          2  Beth            85              15
2          3  Carl            65              30
3          4  Dana            92               5


***Real-life Analogy***: Merging is like combining two spreadsheets using a common column like "Student ID" to link them together.

# ***pd.merge()***:
Combining DataFrames Like a Database JOIN
Think of pd.merge() as joining two tables in a database using a common key or column. This is perfect when you have related data spread across multiple DataFrames.

***Theory***:

***Purpose***: To combine DataFrames based on the values in one or more common columns.

***Key Arguments***:

**left**: The left DataFrame.

**right**: The right DataFrame.

**on**: The column(s) to join on.

**how**: The type of merge to perform (e.g., 'inner', 'outer', 'left', 'right').

# ***Example 1: Inner Merge***
This is like a Venn diagram intersection. It only keeps rows where the joining key exists in both DataFrames.

In [67]:
import pandas as pd

# Let's create two fictional DataFrames: one with personal info and one with social scores
df_personal_info = pd.DataFrame({
    'Person_ID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Personality': ['Introvert', 'Extrovert', 'Introvert', 'Extrovert']
})

df_social_scores = pd.DataFrame({
    'Person_ID': [101, 102, 105, 104],  # Note the missing ID 103 and extra ID 105
    'Social_Score': [75, 92, 60, 88],
    'Friends_circle_size': [5, 14, 2, 11]
})

print("DataFrame 1 (Personal Info):")
print(df_personal_info)
print("\nDataFrame 2 (Social Scores):")
print(df_social_scores)

# Inner merge: Only includes Person_IDs that are in BOTH DataFrames
merged_inner = pd.merge(df_personal_info, df_social_scores, on='Person_ID', how='inner')

print("\n--- Inner Merge Result (Intersection) ---")
print("Keeps only matching IDs (101, 102, 104):")
print(merged_inner)

DataFrame 1 (Personal Info):
   Person_ID     Name Personality
0        101    Alice   Introvert
1        102      Bob   Extrovert
2        103  Charlie   Introvert
3        104    David   Extrovert

DataFrame 2 (Social Scores):
   Person_ID  Social_Score  Friends_circle_size
0        101            75                    5
1        102            92                   14
2        105            60                    2
3        104            88                   11

--- Inner Merge Result (Intersection) ---
Keeps only matching IDs (101, 102, 104):
   Person_ID   Name Personality  Social_Score  Friends_circle_size
0        101  Alice   Introvert            75                    5
1        102    Bob   Extrovert            92                   14
2        104  David   Extrovert            88                   11


***Output Explanation***: Person_ID 103 (Charlie) is dropped because he's not in df_social_scores. Person_ID 105 is dropped because they are not in df_personal_info.

# ***Example 2: Left Merge***
This keeps all rows from the "left" DataFrame and matches them with rows from the "right" DataFrame. If there's no match in the right DataFrame, it fills the columns with NaN.

In [68]:
# Left merge: Keeps all rows from df_personal_info
merged_left = pd.merge(df_personal_info, df_social_scores, on='Person_ID', how='left')

print("\n--- Left Merge Result (Keeps all from left) ---")
print("Keeps all from Personal Info, filling NaN for unmatched rows:")
print(merged_left)


--- Left Merge Result (Keeps all from left) ---
Keeps all from Personal Info, filling NaN for unmatched rows:
   Person_ID     Name Personality  Social_Score  Friends_circle_size
0        101    Alice   Introvert          75.0                  5.0
1        102      Bob   Extrovert          92.0                 14.0
2        103  Charlie   Introvert           NaN                  NaN
3        104    David   Extrovert          88.0                 11.0


***Output Explanation***: Person_ID 103 (Charlie) is included, but their score columns are filled with NaN because there was no matching data in df_social_scores.

# ***Example 3: Right Merge***
The opposite of a left merge. It keeps all rows from the "right" DataFrame.

In [69]:
# Right merge: Keeps all rows from df_social_scores
merged_right = pd.merge(df_personal_info, df_social_scores, on='Person_ID', how='right')

print("\n--- Right Merge Result (Keeps all from right) ---")
print("Keeps all from Social Scores, filling NaN for unmatched rows:")
print(merged_right)


--- Right Merge Result (Keeps all from right) ---
Keeps all from Social Scores, filling NaN for unmatched rows:
   Person_ID   Name Personality  Social_Score  Friends_circle_size
0        101  Alice   Introvert            75                    5
1        102    Bob   Extrovert            92                   14
2        105    NaN         NaN            60                    2
3        104  David   Extrovert            88                   11


***Output Explanation***: Person_ID 105 is included, but their personal info columns are filled with NaN.

# ***Example 4: Outer Merge***
This is like a full Venn diagram union. It keeps all rows from both DataFrames, filling NaN where there are no matches.

In [70]:
# Outer merge: Keeps all rows from both DataFrames
merged_outer = pd.merge(df_personal_info, df_social_scores, on='Person_ID', how='outer')

print("\n--- Outer Merge Result (Union) ---")
print("Keeps all from both DataFrames:")
print(merged_outer)


--- Outer Merge Result (Union) ---
Keeps all from both DataFrames:
   Person_ID     Name Personality  Social_Score  Friends_circle_size
0        101    Alice   Introvert          75.0                  5.0
1        102      Bob   Extrovert          92.0                 14.0
2        103  Charlie   Introvert           NaN                  NaN
3        104    David   Extrovert          88.0                 11.0
4        105      NaN         NaN          60.0                  2.0


***Output Explanation***: All IDs (101, 102, 103, 104, 105) are included. The non-matching columns are filled with NaN.

***Real-life Analogy***: You have two spreadsheets: one with employee IDs and names, and another with employee IDs and their salary. You merge them on "employee ID" to create a single spreadsheet with names and salaries.

# ***pd.concat(): Stacking DataFrames***
Think of pd.concat() as a Lego builder. You're either stacking bricks on top of each other (adding rows) or putting them side-by-side (adding columns). It's used for simple concatenation, not for joining on a key.

**Theory**:

**Purpose**: To concatenate Pandas objects along a particular axis (rows or columns).

***Key Arguments***:

***objs***: A list of DataFrames to concatenate.

***axis***: 0 for stacking rows (default), 1 for stacking columns.

***ignore_index***: Set to True to reset the index after concatenation.

***Example 1: Concatenating Rows (axis=0)***
This is useful when you have data from the same source split into different files or time periods.

In [72]:
# Part 1 of the dataset
df_q1 = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar'],
    'Time_spent_Alone': [8, 9, 7],
    'Personality': ['Introvert', 'Introvert', 'Introvert']
})

# Part 2 of the dataset
df_q2 = pd.DataFrame({
    'Month': ['Apr', 'May', 'Jun'],
    'Time_spent_Alone': [1, 2, 3],
    'Personality': ['Extrovert', 'Extrovert', 'Extrovert']
})

print("DataFrame Q1:")
print(df_q1)
print("\nDataFrame Q2:")
print(df_q2)

# Concatenate them by stacking rows
# Note: The index will be duplicated (0, 1, 2, 0, 1, 2)
concatenated_df = pd.concat([df_q1, df_q2])

print("\n--- Concatenated DataFrame (Rows) with duplicated index ---")
print(concatenated_df)


DataFrame Q1:
  Month  Time_spent_Alone Personality
0   Jan                 8   Introvert
1   Feb                 9   Introvert
2   Mar                 7   Introvert

DataFrame Q2:
  Month  Time_spent_Alone Personality
0   Apr                 1   Extrovert
1   May                 2   Extrovert
2   Jun                 3   Extrovert

--- Concatenated DataFrame (Rows) with duplicated index ---
  Month  Time_spent_Alone Personality
0   Jan                 8   Introvert
1   Feb                 9   Introvert
2   Mar                 7   Introvert
0   Apr                 1   Extrovert
1   May                 2   Extrovert
2   Jun                 3   Extrovert


In [73]:
# Concatenate with reset index
concatenated_df_reset = pd.concat([df_q1, df_q2], ignore_index=True)

print("\n--- Concatenated DataFrame with reset index ---")
print(concatenated_df_reset)


--- Concatenated DataFrame with reset index ---
  Month  Time_spent_Alone Personality
0   Jan                 8   Introvert
1   Feb                 9   Introvert
2   Mar                 7   Introvert
3   Apr                 1   Extrovert
4   May                 2   Extrovert
5   Jun                 3   Extrovert


# ***Example 2: Concatenating Columns (axis=1)***
This is useful when you have the same number of rows in two DataFrames and want to combine them side-by-side.

In [74]:
# DataFrame with personal info
df_names = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22]
})

# DataFrame with social info (assuming the rows align)
df_social = pd.DataFrame({
    'Social_Score': [75, 92, 60],
    'Friends_circle_size': [5, 14, 2]
})

print("DataFrame with Names:")
print(df_names)
print("\nDataFrame with Social Info:")
print(df_social)

# Concatenate them side-by-side
concatenated_columns = pd.concat([df_names, df_social], axis=1)

print("\n--- Concatenated DataFrame (Columns) ---")
print(concatenated_columns)

DataFrame with Names:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22

DataFrame with Social Info:
   Social_Score  Friends_circle_size
0            75                    5
1            92                   14
2            60                    2

--- Concatenated DataFrame (Columns) ---
      Name  Age  Social_Score  Friends_circle_size
0    Alice   25            75                    5
1      Bob   30            92                   14
2  Charlie   22            60                    2


# ***Example 3: Combining with Mismatched Columns***
pd.concat() is flexible. If you stack DataFrames with different columns, it will fill the missing values with NaN.

In [75]:
df_a = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df_b = pd.DataFrame({'col3': [5, 6], 'col4': [7, 8]})

# Concatenate with mismatched columns
concatenated_mismatch = pd.concat([df_a, df_b], axis=1)

print("\n--- Concatenated DataFrame with Mismatched Columns ---")
print(concatenated_mismatch)


--- Concatenated DataFrame with Mismatched Columns ---
   col1  col2  col3  col4
0     1     3     5     7
1     2     4     6     8


***Real-life Analogy***: You have monthly sales data for your store, and each month's data is in a separate CSV file (jan_sales.csv, feb_sales.csv, etc.). You use pd.concat() to stack all these monthly DataFrames on top of each other to get a single DataFrame for the whole year.

# ***3. iloc vs. loc***
This is a key concept for advanced indexing.

***Theory***:

***iloc (integer location)***: Used for integer-based indexing. It works like indexing a list or NumPy array. You use integer positions (e.g., 0, 1, 2).

***loc (label location)***: Used for label-based indexing. You use the actual row and column labels.

# ***Example 1: Using loc***
Let's use our small_df with a custom index.



In [60]:
small_df_labeled = pd.DataFrame(data, index=['P1', 'P2', 'P3', 'P4'])
small_df_labeled['Drained_after_socializing'] = ['Yes', 'No', 'Yes', 'No']
print("\nDataFrame with custom labels:")
print(small_df_labeled)



DataFrame with custom labels:
    Time_spent_Alone Stage_fear Personality Drained_after_socializing
P1                 8        Yes   Introvert                       Yes
P2                 1         No   Extrovert                        No
P3                 9        Yes   Introvert                       Yes
P4                 2         No   Extrovert                        No


In [61]:
# Access row with label 'P2'
print("\nRow with label 'P2' using .loc:")
print(small_df_labeled.loc['P2'])



Row with label 'P2' using .loc:
Time_spent_Alone                     1
Stage_fear                          No
Personality                  Extrovert
Drained_after_socializing           No
Name: P2, dtype: object


In [62]:
# Access a specific cell using row and column labels
print("\n'Stage_fear' for person 'P3' using .loc:")
print(small_df_labeled.loc['P3', 'Stage_fear'])



'Stage_fear' for person 'P3' using .loc:
Yes


In [63]:
# Access multiple rows and columns using labels
print("\nSelected rows and columns using .loc:")
print(small_df_labeled.loc[['P1', 'P4'], ['Time_spent_Alone', 'Personality']])


Selected rows and columns using .loc:
    Time_spent_Alone Personality
P1                 8   Introvert
P4                 2   Extrovert


# ***Example 2: Using iloc***
Now, let's use integer positions.

In [64]:
# Access the row at integer position 1 (the second row)
print("\nRow at integer position 1 using .iloc:")
print(small_df_labeled.iloc[1])



Row at integer position 1 using .iloc:
Time_spent_Alone                     1
Stage_fear                          No
Personality                  Extrovert
Drained_after_socializing           No
Name: P2, dtype: object


In [65]:
# Access the cell at row position 2, column position 0
print("\nValue at row 2, col 0 using .iloc:")
print(small_df_labeled.iloc[2, 0])



Value at row 2, col 0 using .iloc:
9


In [66]:
# Access rows from position 0 to 2 (exclusive) and columns from 0 to 1 (exclusive)
print("\nRows 0, 1 and columns 0 using slicing with .iloc:")
print(small_df_labeled.iloc[0:2, 0:2])


Rows 0, 1 and columns 0 using slicing with .iloc:
    Time_spent_Alone Stage_fear
P1                 8        Yes
P2                 1         No


***Fun Fact***: iloc is great when you just want to grab the "first 3 rows" of a DataFrame, regardless of their labels. loc is essential when you have meaningful labels (like 'Person_A' or '2023-01-01').