### **Assignment: Getting Started with Pandas**

### **Total Points: 100**

In [2]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### **Assignment 1: Creating and Manipulating DataFrames** (20 points)
**Objective**: Understand the basics of creating and manipulating pandas DataFrames.


1. **Task**:
   - Create a pandas DataFrame from a dictionary where the keys represent column names and the values are lists of data. For example:
     ```python
     data = {
         'Name': ['John', 'Jane', 'Dave', 'Anna'],
         'Age': [23, 25, 22, 29],
         'Score': [88, 92, 85, 90]
     }
     ```
   - Perform the following operations:
     - Add a new column `Passed` where the value is `True` if the `Score` is greater than 85, otherwise `False`.
     - Sort the DataFrame by the `Age` column in descending order.

In [3]:
data = {
    'Name': ['John', 'Jane', 'Dave', 'Anna'],
    'Age': [23, 25, 22, 29],
    'Score': [88, 92, 85, 90]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Step 2: Add a new column 'Passed' based on the 'Score' column
# If 'Score' > 85, 'Passed' is True; otherwise, it is False
df['Passed'] = df['Score'] > 85
print("\nDataFrame with 'Passed' Column:\n", df)

# Step 3: Sort the DataFrame by the 'Age' column in descending order
df = df.sort_values(by='Age', ascending=False)
print("\nDataFrame Sorted by Age (Descending):\n", df)

Original DataFrame:
    Name  Age  Score
0  John   23     88
1  Jane   25     92
2  Dave   22     85
3  Anna   29     90

DataFrame with 'Passed' Column:
    Name  Age  Score  Passed
0  John   23     88    True
1  Jane   25     92    True
2  Dave   22     85   False
3  Anna   29     90    True

DataFrame Sorted by Age (Descending):
    Name  Age  Score  Passed
3  Anna   29     90    True
1  Jane   25     92    True
0  John   23     88    True
2  Dave   22     85   False


2. **Bonus**: Filter the DataFrame to display only the rows where the `Score` is above 90.

In [4]:
high_score_df = df[df['Score'] > 90]
print("\nRows Where Score is Above 90:\n", high_score_df)


Rows Where Score is Above 90:
    Name  Age  Score  Passed
1  Jane   25     92    True


### **Assignment 2: Indexing, Selection, and Filtering** (20 points)
**Objective**: Learn how to access and filter data in DataFrames using different indexing techniques.

1. **Task**:
   - Create a DataFrame containing students and their scores across different subjects:
     ```python
     data = {
         'Student': ['John', 'Jane', 'Dave', 'Anna'],
         'Math': [80, 95, 85, 92],
         'English': [78, 89, 94, 88],
         'History': [85, 90, 88, 92]
     }
     ```
   - Select the `Math` and `English` scores for all students.
   - Filter the DataFrame to show only the students who scored more than 90 in `Math` or `English`.


In [None]:
data = {
    'Student': ['John', 'Jane', 'Dave', 'Anna'],
    'Math': [80, 95, 85, 92],
    'English': [78, 89, 94, 88],
    'History': [85, 90, 88, 92]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

math_english_df = df[['Student', 'Math', 'English']]
print("\nDataFrame with Math and English Scores:\n", math_english_df)

high_scorers_df = math_english_df[(math_english_df['Math'] > 90) | (math_english_df['English'] > 90)]
print("\nStudents Who Scored More Than 90 in Math or English:\n", high_scorers_df)

### **Assignment 3: Handling Missing Data** (15 points)
**Objective**: Work with missing data in pandas.

1. **Task**:
   - Create a DataFrame with missing values:
     ```python
     data = {
         'Name': ['John', 'Jane', 'Dave', 'Anna'],
         'Age': [23, 25, np.nan, 29],
         'Score': [88, np.nan, 85, 90]
     }
     ```
   - Perform the following operations:
     - Fill the missing `Age` with the mean age.
     - Drop any rows where the `Score` is missing.
     - Replace any remaining missing values in the DataFrame with `0`.

In [6]:
import numpy as np

In [7]:
data = {
    'Name': ['John', 'Jane', 'Dave', 'Anna'],
    'Age': [23, 25, np.nan, 29],
    'Score': [88, np.nan, 85, 90]
}

df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:\n", df)

# Step 2: Fill the missing 'Age' with the mean age
mean_age = df['Age'].mean()  # Calculate the mean age
df['Age'] = df['Age'].fillna(mean_age)  # Fill missing values in 'Age' with the mean
print("\nDataFrame after Filling Missing Age with Mean:\n", df)

# Step 3: Drop any rows where the 'Score' is missing
df = df.dropna(subset=['Score'])  # Drop rows with missing 'Score'
print("\nDataFrame after Dropping Rows with Missing Score:\n", df)

# Step 4: Replace any remaining missing values in the DataFrame with 0
df = df.fillna(0)  # Replace any remaining NaNs with 0
print("\nDataFrame after Replacing Remaining Missing Values with 0:\n", df)

Original DataFrame with Missing Values:
    Name   Age  Score
0  John  23.0   88.0
1  Jane  25.0    NaN
2  Dave   NaN   85.0
3  Anna  29.0   90.0

DataFrame after Filling Missing Age with Mean:
    Name        Age  Score
0  John  23.000000   88.0
1  Jane  25.000000    NaN
2  Dave  25.666667   85.0
3  Anna  29.000000   90.0

DataFrame after Dropping Rows with Missing Score:
    Name        Age  Score
0  John  23.000000   88.0
2  Dave  25.666667   85.0
3  Anna  29.000000   90.0

DataFrame after Replacing Remaining Missing Values with 0:
    Name        Age  Score
0  John  23.000000   88.0
2  Dave  25.666667   85.0
3  Anna  29.000000   90.0


### **Assignment 4: Grouping and Aggregating Data** (20 points)
**Objective**: Use pandas to group and aggregate data.

1. **Task**:
   - Create a DataFrame containing sales data for a store:
     ```python
     data = {
         'Store': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
         'Sales': [150, 200, 100, 250, 300, 120, 180],
         'Week': [1, 1, 2, 2, 3, 3, 3]
     }
     ```
   - Group the data by `Store` and calculate:
     - The total sales for each store.
     - The average sales per store.
   - Find the store with the highest total sales.

In [8]:
data = {
    'Store': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Sales': [150, 200, 100, 250, 300, 120, 180],
    'Week': [1, 1, 2, 2, 3, 3, 3]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

total_sales = df.groupby('Store')['Sales'].sum()
print("\nTotal Sales for Each Store:\n", total_sales)

average_sales = df.groupby('Store')['Sales'].mean()
print("\nAverage Sales per Store:\n", average_sales)

highest_sales_store = total_sales.idxmax() 
highest_sales_value = total_sales.max()  
print(f"\nStore with the Highest Total Sales: {highest_sales_store} (Sales: {highest_sales_value})")

Original DataFrame:
   Store  Sales  Week
0     A    150     1
1     B    200     1
2     A    100     2
3     C    250     2
4     B    300     3
5     A    120     3
6     C    180     3

Total Sales for Each Store:
 Store
A    370
B    500
C    430
Name: Sales, dtype: int64

Average Sales per Store:
 Store
A    123.333333
B    250.000000
C    215.000000
Name: Sales, dtype: float64

Store with the Highest Total Sales: B (Sales: 500)


### **Assignment 5: Merging and Joining DataFrames** (15 points)
**Objective**: Practice merging and joining multiple DataFrames.

1. **Task**:
   - Create two DataFrames, one for student information and another for their scores:
     ```python
     students = {
         'StudentID': [1, 2, 3, 4],
         'Name': ['John', 'Jane', 'Dave', 'Anna']
     }
     scores = {
         'StudentID': [1, 2, 4],
         'Math': [80, 95, 92],
         'English': [78, 89, 88]
     }
     ```
   - Perform the following operations:
     - Merge the two DataFrames on `StudentID`.
     - Display all students, including those who do not have scores (use a left join).

In [10]:
students = {
    'StudentID': [1, 2, 3, 4],
    'Name': ['John', 'Jane', 'Dave', 'Anna']
}

scores = {
    'StudentID': [1, 2, 4],
    'Math': [80, 95, 92],
    'English': [78, 89, 88]
}

students_df = pd.DataFrame(students)
scores_df = pd.DataFrame(scores)

print("Student Information DataFrame:\n", students_df)
print("\nScores DataFrame:\n", scores_df)

merged_df = pd.merge(students_df, scores_df, on='StudentID')
print("\nMerged DataFrame:\n", merged_df)

left_merged_df = pd.merge(students_df, scores_df, on='StudentID', how='left')
print("\nMerged DataFrame (Left Join):\n", left_merged_df)


Student Information DataFrame:
    StudentID  Name
0          1  John
1          2  Jane
2          3  Dave
3          4  Anna

Scores DataFrame:
    StudentID  Math  English
0          1    80       78
1          2    95       89
2          4    92       88

Merged DataFrame:
    StudentID  Name  Math  English
0          1  John    80       78
1          2  Jane    95       89
2          4  Anna    92       88

Merged DataFrame (Left Join):
    StudentID  Name  Math  English
0          1  John  80.0     78.0
1          2  Jane  95.0     89.0
2          3  Dave   NaN      NaN
3          4  Anna  92.0     88.0


### **Assignment 6: Applying Functions to DataFrames** (10 points)
**Objective**: Use pandas to apply functions to DataFrame elements.

1. **Task**:
   - Using the DataFrame below, apply a custom function that converts the `Score` column to letter grades (`A` for 90 and above, `B` for 80-89, etc.):
     ```python
     data = {
         'Name': ['John', 'Jane', 'Dave', 'Anna'],
         'Score': [88, 92, 85, 90]
     }
     ```


In [12]:
data = {
    'Name': ['John', 'Jane', 'Dave', 'Anna'],
    'Score': [88, 92, 85, 90]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

def convert_to_grade(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    else:
        return 'C'

df['Grade'] = df['Score'].apply(convert_to_grade)
print("\nDataFrame with Grades:\n", df)

Original DataFrame:
    Name  Score
0  John     88
1  Jane     92
2  Dave     85
3  Anna     90

DataFrame with Grades:
    Name  Score Grade
0  John     88     B
1  Jane     92     A
2  Dave     85     B
3  Anna     90     A


### **Assignment 7: Reshaping and Pivoting Data** (10 points)
**Objective**: Learn to reshape and pivot data in pandas.

1. **Task**:
   - Create a DataFrame containing temperature data for different cities across various dates:
     ```python
     data = {
         'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago'],
         'Date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
         'Temperature': [30, 75, 25, 32, 77, 27]
     }
     ```
   - Pivot the data so that the rows represent the dates and the columns represent the cities, with the `Temperature` values in the cells.

In [13]:
data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago'],
    'Date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-02'],
    'Temperature': [30, 75, 25, 32, 77, 27]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

pivoted_df = df.pivot(index='Date', columns='City', values='Temperature')
print("\nPivoted DataFrame:\n", pivoted_df)

Original DataFrame:
           City        Date  Temperature
0     New York  2023-01-01           30
1  Los Angeles  2023-01-01           75
2      Chicago  2023-01-01           25
3     New York  2023-01-02           32
4  Los Angeles  2023-01-02           77
5      Chicago  2023-01-02           27

Pivoted DataFrame:
 City        Chicago  Los Angeles  New York
Date                                      
2023-01-01       25           75        30
2023-01-02       27           77        32


### **Assignment 8: Time Series Data** (10 points)
**Objective**: Work with time series data in pandas.

1. **Task**:
   - Create a date range of 10 consecutive days starting from `2023-01-01`.
   - Create a DataFrame where the index is the date range and the column contains random daily stock prices.
   - Calculate the rolling 3-day average of the stock prices.

In [14]:
import numpy as np
date_range = pd.date_range(start='2023-01-01', periods=10)
print("Date Range:\n", date_range)

np.random.seed(0)  
stock_prices = np.random.randint(100, 200, size=10)  
df = pd.DataFrame(stock_prices, index=date_range, columns=['Stock Price'])
print("\nDataFrame with Random Stock Prices:\n", df)

df['3-Day Rolling Average'] = df['Stock Price'].rolling(window=3).mean()
print("\nDataFrame with 3-Day Rolling Average:\n", df)

Date Range:
 DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10'],
              dtype='datetime64[ns]', freq='D')

DataFrame with Random Stock Prices:
             Stock Price
2023-01-01          144
2023-01-02          147
2023-01-03          164
2023-01-04          167
2023-01-05          167
2023-01-06          109
2023-01-07          183
2023-01-08          121
2023-01-09          136
2023-01-10          187

DataFrame with 3-Day Rolling Average:
             Stock Price  3-Day Rolling Average
2023-01-01          144                    NaN
2023-01-02          147                    NaN
2023-01-03          164             151.666667
2023-01-04          167             159.333333
2023-01-05          167             166.000000
2023-01-06          109             147.666667
2023-01-07          183             153.000000
2023-01-08          121            