## **Pandas**
* Pandas is a powerful, open-source data analysis and data manipulation library for Python. 

* It provides data structures such as Series (one-dimensional) and DataFrame (two-dimensional), which are flexible and easy to use for handling labeled data. 

* With pandas, you can perform a variety of data operations, including merging, reshaping, cleaning, and aggregating data, making it an essential tool for data scientists and analysts. 

In [None]:
# Installing pandas
! pip install pandas

In [5]:
# Importing pandas
import pandas as pd

# Supress warnings
import warnings
warnings.filterwarnings('ignore')

### Series: A one-dimensional labeled array.

In [6]:
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

### DataFrame: A two-dimensional labeled data structure, like a table.

In [7]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Ram', None, 'Eva'],
    'Age': [25, 30, None, 34, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Kathmandu', 'Chicago', 'Houston'],
    'Occupation': ['Engineer', 'Artist', 'Doctor', 'Teacher', 'Developer', 'Designer'],
    'Marital Status': ['Single', 'Married', 'Single', 'Married', 'Single', 'Single']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,25.0,New York,Engineer,Single
1,Bob,30.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single
3,Ram,34.0,Kathmandu,Teacher,Married
4,,22.0,Chicago,Developer,Single
5,Eva,28.0,Houston,Designer,Single


### Reading and Writing Data
   1. Reading form different file format
      1. df = pd.read_csv('data.csv')
      2. df = pd.read_json('data.json')
      3. df = pd.read_csv('data.txt', delimiter='\t')
   
   2. Writing data to different file format
      1. df.to_csv('output.csv', index=False)
      2. df.to_json('output.json', orient='records', lines=True) #in place of records we can also use (split, index, columns, values)
      3. df.to_csv('output.txt', sep='\t', index=False)

In [8]:
df.to_csv('output.csv', index=False)

In [9]:
df = pd.read_csv('output.csv')

### Data Inspection

In [10]:
# Displaying the first few rows:
df.head()  # First 5 rows

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,25.0,New York,Engineer,Single
1,Bob,30.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single
3,Ram,34.0,Kathmandu,Teacher,Married
4,,22.0,Chicago,Developer,Single


In [11]:
df.tail()  # Last 5 rows

Unnamed: 0,Name,Age,City,Occupation,Marital Status
1,Bob,30.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single
3,Ram,34.0,Kathmandu,Teacher,Married
4,,22.0,Chicago,Developer,Single
5,Eva,28.0,Houston,Designer,Single


### Getting basic information about the DataFrame

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            5 non-null      object 
 1   Age             5 non-null      float64
 2   City            6 non-null      object 
 3   Occupation      6 non-null      object 
 4   Marital Status  6 non-null      object 
dtypes: float64(1), object(4)
memory usage: 368.0+ bytes


In [13]:
# Viewing column names:
df.columns

Index(['Name', 'Age', 'City', 'Occupation', 'Marital Status'], dtype='object')

In [14]:
df.dtypes

Name               object
Age               float64
City               object
Occupation         object
Marital Status     object
dtype: object

In [15]:
# provides a summary of the statistical properties of the numeric columns in a DataFrame.
df.describe()

Unnamed: 0,Age
count,5.0
mean,27.8
std,4.604346
min,22.0
25%,25.0
50%,28.0
75%,30.0
max,34.0


In [16]:
# Transposing your data:
df.T

Unnamed: 0,0,1,2,3,4,5
Name,Alice,Bob,Charlie,Ram,,Eva
Age,25.0,30.0,,34.0,22.0,28.0
City,New York,San Francisco,Los Angeles,Kathmandu,Chicago,Houston
Occupation,Engineer,Artist,Doctor,Teacher,Developer,Designer
Marital Status,Single,Married,Single,Married,Single,Single


In [17]:
# sorts by an index:
sorted_column_df = df.sort_index(axis=1, ascending=True)
sorted_column_df

Unnamed: 0,Age,City,Marital Status,Name,Occupation
0,25.0,New York,Single,Alice,Engineer
1,30.0,San Francisco,Married,Bob,Artist
2,,Los Angeles,Single,Charlie,Doctor
3,34.0,Kathmandu,Married,Ram,Teacher
4,22.0,Chicago,Single,,Developer
5,28.0,Houston,Single,Eva,Designer


In [18]:
# sorts by values:
sorted_rows_df = df.sort_values(by='Name', ascending=True)
sorted_rows_df

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,25.0,New York,Engineer,Single
1,Bob,30.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single
5,Eva,28.0,Houston,Designer,Single
3,Ram,34.0,Kathmandu,Teacher,Married
4,,22.0,Chicago,Developer,Single


### Selecting Data
* Here are some key reasons and methods for data selection in Pandas:  
      1. Data Cleaning  
      2. Data Analysis  
      3. Filtering Data  
      4. Aggregation and Grouping, etc  


In [19]:
# selecting single column
ages = df['Age']
print(ages)

0    25.0
1    30.0
2     NaN
3    34.0
4    22.0
5    28.0
Name: Age, dtype: float64


In [20]:
ages = ages + 1
print(ages)
print(df)

0    26.0
1    31.0
2     NaN
3    35.0
4    23.0
5    29.0
Name: Age, dtype: float64
      Name   Age           City Occupation Marital Status
0    Alice  25.0       New York   Engineer         Single
1      Bob  30.0  San Francisco     Artist        Married
2  Charlie   NaN    Los Angeles     Doctor         Single
3      Ram  34.0      Kathmandu    Teacher        Married
4      NaN  22.0        Chicago  Developer         Single
5      Eva  28.0        Houston   Designer         Single


In [21]:
# selecting multiple column
subset = df[['Name', 'City']]
print(subset)

      Name           City
0    Alice       New York
1      Bob  San Francisco
2  Charlie    Los Angeles
3      Ram      Kathmandu
4      NaN        Chicago
5      Eva        Houston


In [22]:
df2 = df[0:3]
df2

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,25.0,New York,Engineer,Single
1,Bob,30.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single


* Selection by labelusing DataFrame.loc() or DataFrame.at().

In [23]:
# Selecting a row matching a label:
# Select the row where the City is 'New York'
selected_row = df[df['City'] == 'New York']
print(selected_row)

    Name   Age      City Occupation Marital Status
0  Alice  25.0  New York   Engineer         Single


In [24]:
selected_row = df.loc[df['City'] == 'New York']
selected_row

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,25.0,New York,Engineer,Single


In [25]:
# Selecting all rows (:) with a select column labels:
df.loc[:, ["Name", "Age"]]

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
2,Charlie,
3,Ram,34.0
4,,22.0
5,Eva,28.0


In [26]:
# Define the condition for selecting rows
condition = df['City'].isin(['New York', 'Chicago'])

# Use .loc to select specific rows and columns
selected_data = df.loc[condition, ['Name', 'Age', 'Occupation']]
selected_data

Unnamed: 0,Name,Age,Occupation
0,Alice,25.0,Engineer
4,,22.0,Developer


In [27]:
# optimized for getting and setting scalar values
age_value = df.at[0, "Age"]
age_value

25.0

Selection by position
    using DataFrame.iloc() or DataFrame.iat().

In [28]:
# Selecting rows by index:
first_row = df.iloc[0]
first_row

Name                 Alice
Age                   25.0
City              New York
Occupation        Engineer
Marital Status      Single
Name: 0, dtype: object

In [29]:
age_value = df.iat[0, 1]  # The second column (index 1) is 'Age'
age_value

25.0

In [30]:
# making copy
df3 = df.copy()
df3

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,25.0,New York,Engineer,Single
1,Bob,30.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single
3,Ram,34.0,Kathmandu,Teacher,Married
4,,22.0,Chicago,Developer,Single
5,Eva,28.0,Houston,Designer,Single


### TASK 1:

1. Filter the DataFrame to only include rows where the Price is greater than $50
2. Filter the resulting DataFrame to only include rows where the Category is either "Electronics" or "Fashion"
3. Print the resulting DataFrame

### Create the sample DataFrame dataframe
data = {  
    'Product': ['Laptop', 'Shirt', 'Smartphone', 'Shoes', 'Headphones', 'Watch'],  
    'Price': [999.99, 29.99, 599.99, 89.99, 199.99, 49.99],  
    'Category': ['Electronics', 'Fashion', 'Electronics', 'Fashion', 'Electronics', 'Fashion']  
}

In [31]:
#use dataframe1 as name of dataframe
#dataframe1 = 

In [32]:
# Create a Series
s = pd.Series(['a', 'b', 'c', 'd'])

# Use map to convert all elements to uppercase
result = s.map(str.upper)
result

0    A
1    B
2    C
3    D
dtype: object

Using reduce function to perform operations in series

In [33]:
from functools import reduce

# Create a Series
s = pd.Series([1, 2, 3, 4, 5])

# Convert the Series to a list and use reduce to calculate the sum of all elements
result = reduce(lambda x, y: x + y, s.tolist())
result

15

In [34]:
# Use filter to get all elements greater than 3
result = s[s > 3]
print(result)

3    4
4    5
dtype: int64


In [35]:
# Create a DataFrame
dataf = pd.DataFrame({
    'Name': ['John', 'Jane', 'Bob'],
    'Age': [25, 35, 20]
})

# Use map to convert all names to uppercase
result = dataf['Name'].map(str.upper)
print(result)

0    JOHN
1    JANE
2     BOB
Name: Name, dtype: object


In [36]:
# Create a DataFrame
dataf = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Use reduce to concatenate the columns
result = reduce(lambda x, y: x + y, [dataf[col] for col in dataf.columns])
print(result)

0    12
1    15
2    18
dtype: int64


6. Modifying Data

In [37]:
# Adding a new column:

df['Salary'] =  [70000, 80000, 120000, 50000, None, 75000]
print(df)


      Name   Age           City Occupation Marital Status    Salary
0    Alice  25.0       New York   Engineer         Single   70000.0
1      Bob  30.0  San Francisco     Artist        Married   80000.0
2  Charlie   NaN    Los Angeles     Doctor         Single  120000.0
3      Ram  34.0      Kathmandu    Teacher        Married   50000.0
4      NaN  22.0        Chicago  Developer         Single       NaN
5      Eva  28.0        Houston   Designer         Single   75000.0


In [38]:
# Modifying an existing column:

df['Age'] = df['Age'] + 1
print(df)

      Name   Age           City Occupation Marital Status    Salary
0    Alice  26.0       New York   Engineer         Single   70000.0
1      Bob  31.0  San Francisco     Artist        Married   80000.0
2  Charlie   NaN    Los Angeles     Doctor         Single  120000.0
3      Ram  35.0      Kathmandu    Teacher        Married   50000.0
4      NaN  23.0        Chicago  Developer         Single       NaN
5      Eva  29.0        Houston   Designer         Single   75000.0


In [39]:
# Deleting a column:

df = df.drop(columns=['Salary'])
print(df)

      Name   Age           City Occupation Marital Status
0    Alice  26.0       New York   Engineer         Single
1      Bob  31.0  San Francisco     Artist        Married
2  Charlie   NaN    Los Angeles     Doctor         Single
3      Ram  35.0      Kathmandu    Teacher        Married
4      NaN  23.0        Chicago  Developer         Single
5      Eva  29.0        Houston   Designer         Single


7. Handling Missing Data

In [40]:
# Detecting missing data:

print(df.isnull())
print(df.isnull().sum())

    Name    Age   City  Occupation  Marital Status
0  False  False  False       False           False
1  False  False  False       False           False
2  False   True  False       False           False
3  False  False  False       False           False
4   True  False  False       False           False
5  False  False  False       False           False
Name              1
Age               1
City              0
Occupation        0
Marital Status    0
dtype: int64


In [41]:
# Dropping rows with missing data:
print("before filling none value")
print(df3)
df3.dropna(inplace=True)
print("after filling none value")
print(df3)

before filling none value
      Name   Age           City Occupation Marital Status
0    Alice  25.0       New York   Engineer         Single
1      Bob  30.0  San Francisco     Artist        Married
2  Charlie   NaN    Los Angeles     Doctor         Single
3      Ram  34.0      Kathmandu    Teacher        Married
4      NaN  22.0        Chicago  Developer         Single
5      Eva  28.0        Houston   Designer         Single
after filling none value
    Name   Age           City Occupation Marital Status
0  Alice  25.0       New York   Engineer         Single
1    Bob  30.0  San Francisco     Artist        Married
3    Ram  34.0      Kathmandu    Teacher        Married
5    Eva  28.0        Houston   Designer         Single


In [42]:
# fills missing data:
df5 =df.copy()
df5.fillna(value=5)

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,26.0,New York,Engineer,Single
1,Bob,31.0,San Francisco,Artist,Married
2,Charlie,5.0,Los Angeles,Doctor,Single
3,Ram,35.0,Kathmandu,Teacher,Married
4,5,23.0,Chicago,Developer,Single
5,Eva,29.0,Houston,Designer,Single


In [43]:
df6 =df.copy()
df6.dropna(subset='Name', inplace=True) 
df6

Unnamed: 0,Name,Age,City,Occupation,Marital Status
0,Alice,26.0,New York,Engineer,Single
1,Bob,31.0,San Francisco,Artist,Married
2,Charlie,,Los Angeles,Doctor,Single
3,Ram,35.0,Kathmandu,Teacher,Married
5,Eva,29.0,Houston,Designer,Single


Operations

In [44]:
# for each column
print(df6['Age'].mean())
df['Age'].mean()

30.25


28.8

In [45]:
print(df['Age'].sum())
print(df['Age'].min())
print(df['Age'].max())
print(df['Age'].median())
print(df['Age'].std())

144.0
23.0
35.0
29.0
4.604345773288535


In [46]:
# Filling missing data:
df6['Age'].fillna(int(df6['Age'].mean()), inplace=True)
print(df6)

      Name   Age           City Occupation Marital Status
0    Alice  26.0       New York   Engineer         Single
1      Bob  31.0  San Francisco     Artist        Married
2  Charlie  30.0    Los Angeles     Doctor         Single
3      Ram  35.0      Kathmandu    Teacher        Married
5      Eva  29.0        Houston   Designer         Single


 Merging and Joining DataFrames
    dataframe schema must be corrected. we use .rename

In [47]:
df7 = df6.copy()
df7.rename(columns={'City': 'Location', 'Age':'Ages'})

Unnamed: 0,Name,Ages,Location,Occupation,Marital Status
0,Alice,26.0,New York,Engineer,Single
1,Bob,31.0,San Francisco,Artist,Married
2,Charlie,30.0,Los Angeles,Doctor,Single
3,Ram,35.0,Kathmandu,Teacher,Married
5,Eva,29.0,Houston,Designer,Single


In [48]:
# concatination is method for combining dataframes:

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})


merged_df2 = pd.concat([df1,df2])

print(merged_df2) 

  key  value1  value2
0   A     1.0     NaN
1   B     2.0     NaN
2   C     3.0     NaN
0   A     NaN     4.0
1   B     NaN     5.0
2   D     NaN     6.0


there are several types of joins used to combine datasets based on common columns or indices. These joins include:
    1. Inner Join:
    2. Left Join:
    3. Right Join:
    4. Cross Join:
1. 
       ![image.png](attachment:image.png)
       ![image-2.png](attachment:image-2.png) 
       ![image-5.png](attachment:image-5.png) 
       ![image-4.png](attachment:image-4.png) 

In [49]:
# Joining DataFrames:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})

# default of merge is inner
merged_df = pd.merge(df1, df2)
print(merged_df) 

merged_df = pd.merge(df1, df2, on='key')
print(merged_df)

  key  value1  value2
0   A       1       4
1   B       2       5
  key  value1  value2
0   A       1       4
1   B       2       5


In [50]:
df3 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3], 'value2': [4, 5, 6]})
df4 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value1': [7, 8, 9], 'value2': [10, 11, 12]})

merged_df = pd.merge(df3, df4)
print(merged_df) # similar to sql join 


Empty DataFrame
Columns: [key, value1, value2]
Index: []


In [51]:
merged_df = pd.merge(df3, df4, on='key')
print(merged_df)

  key  value1_x  value2_x  value1_y  value2_y
0   A         1         4         7        10
1   B         2         5         8        11


In [52]:
merged_df = pd.merge(df3, df4, on='key', how='left')
print(merged_df)

  key  value1_x  value2_x  value1_y  value2_y
0   A         1         4       7.0      10.0
1   B         2         5       8.0      11.0
2   C         3         6       NaN       NaN


In [53]:
# default how is left
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)
# Set 'key' column as index for df2 and perform join
joined_df = df1.set_index('key').join(df2.set_index('key'), how='inner')
joined_df

  key  value1  value2
0   A       1       4
1   B       2       5


Unnamed: 0_level_0,value1,value2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,4
B,2,5


### TASK 2:

1. Merge the two DataFrames on the CustomerID column
2. Print the resulting DataFrame
   
### Create the first sample DataFrame df1
data1 = {
    'CustomerID': [1, 2, 3, 4],
    'Name': ['John Doe', 'Jane Smith', 'Alice Johnson', 'Bob Brown'],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '101 Maple St']
}

### Create the second sample DataFrame df2
data2 = {
    'CustomerID': [1, 2, 1, 3],  
    'OrderDate': ['2024-01-01', '2024-01-05', '2024-01-10', '2024-01-15'],  
    'TotalAmount': [100.50, 200.75, 150.25, 300.00]  
}

In [54]:
#use dataframe2 and dataframe3 as name of dataframe
#dataframe2 =
#dataframe3 = 

### Grouping and Aggregating Data
* This allows you to analyze subsets of data independently. 

* By “group by” we are referring to a process involving one or more of the following steps:
    * Splitting the data into groups based on some criteria
    * Applying a function to each group independently
    * Combining the results into a data structure

In [55]:
# Grouping by a column
grouped = df6.groupby('Age')
print(grouped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f643544ea30>


In [56]:
print(df6)
group = df6.groupby('Marital Status')['Age'].mean()
print(group)

      Name   Age           City Occupation Marital Status
0    Alice  26.0       New York   Engineer         Single
1      Bob  31.0  San Francisco     Artist        Married
2  Charlie  30.0    Los Angeles     Doctor         Single
3      Ram  35.0      Kathmandu    Teacher        Married
5      Eva  29.0        Houston   Designer         Single
Marital Status
Married    33.000000
Single     28.333333
Name: Age, dtype: float64


Aggregation is the process of computing a summary statistic (or statistics) for each group. Common aggregation operations include sum, mean, median, min, max, count, etc. You can use the agg() method to perform these operations on the grouped data.

syntax: grouped.agg({'column_name': 'aggregation_function'})


In [57]:
# Grouping by 'City' and aggregating to find the mean age
grouped = df6.groupby('Marital Status').agg(average_age = ('Age', 'mean'))
grouped

Unnamed: 0_level_0,average_age
Marital Status,Unnamed: 1_level_1
Married,33.0
Single,28.333333


In [58]:
grouped = df6.groupby('Marital Status').agg({'Age':'mean'})
grouped

Unnamed: 0_level_0,Age
Marital Status,Unnamed: 1_level_1
Married,33.0
Single,28.333333


In [59]:
# Grouping by 'City' and applying multiple aggregations
aggregated = df6.groupby('City').agg({
    'Age': ['mean', 'max'],
    'Name': 'count'
}).reset_index()

# Renaming the columns for clarity
#aggregated.columns = ['City', 'Mean Age', 'Max Age', 'Count']

print(aggregated)


            City   Age        Name
                  mean   max count
0        Houston  29.0  29.0     1
1      Kathmandu  35.0  35.0     1
2    Los Angeles  30.0  30.0     1
3       New York  26.0  26.0     1
4  San Francisco  31.0  31.0     1


### TASK 3:
1. Group the DataFrame by the Region column
2. Calculate the total Sales for each region
3. Calculate the average Sales for each region
4. Print the resulting DataFrame

### Create a sample DataFrame
data = {  
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],  
    'Sales': [150, 200, 250, 300, 350, 400, 450, 500],  
    'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q2']  
}