# **Exploring Pandas: Common Data Operations**

Welcome to this Jupyter Notebook! 🚀 In this notebook, you'll practice some of the most commonly used operations in the **Pandas** library using **two datasets**:
1. **../../data/students.csv** (CSV)
2. **../../data/enrollments.json** (JSON)

These files should be placed in the same folder as this notebook. By the end, you'll have a strong grasp of common data manipulation tasks, and you'll even merge these two datasets on a common key.

Before starting, make sure you have **Pandas** installed. (It should come preinstalled in Anaconda!)

If pandas is not installed, follow the instructions below.

---

## **Checking if Pandas is Installed in Your Conda Environment**

Before proceeding, check if Pandas is installed in your Conda environment by running the following command in a **Jupyter Notebook** cell:

In [1]:
import pandas as pd
print(pd.__version__)

2.2.3


If this runs without errors and prints a version number, Pandas is installed. If you see an **ImportError**, install Pandas using one of the following methods:

### **For Conda Users (Recommended)**
Run this in your terminal or Anaconda Prompt:
```
conda install pandas
```

### **Using Conda-Forge (If Needed)**
If you encounter issues, you can install Pandas from **Conda-Forge**, a community-maintained repository with up-to-date packages:
```
conda install -c conda-forge pandas
```

### **For Pip Users**
If you're using a virtual environment outside Conda, install Pandas via Pip:
```
pip install pandas
```

# Now, let’s dive in! 🏊‍♂️

---


## **1. Load a CSV file into a Pandas DataFrame**

First, let's **import Pandas** and load the datasets. Two datasets have been prepared for you:

- `students.csv`
- `enrollments.json`

You will use these two datasets for the following challenges.

**💡 Hint:** If the file is in the same directory as your notebook, you can just use the filename. Otherwise, provide the full file path.


In [2]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
print(df)

    student_id First_Name last_name  Birthddate gender        majorField  \
0       STU001       John       Doe   4/12/1998      M  Computer Science   
1       STU002      Maria  Gonzalez    9/5/1997      F           Biology   
2       STU003      Priya     Patel   1/23/1999      F       Engineering   
3       STU004       Alex   Johnson  12/15/1996      M       Mathematics   
4       STU005      Emily     Smith   7/30/2000      F           Physics   
..         ...        ...       ...         ...    ...               ...   
98      STU096   Victoria     Ortiz    8/2/1996      F       Mathematics   
99      STU097     Julian    Foster    9/3/1998      M           Physics   
100     STU098       Lucy   Ramirez   10/4/1997      F         Chemistry   
101     STU099     Isaiah       Kim   11/5/1999      M         Economics   
102     STU100     Amelia     Lopez   12/6/1996      F           History   

    admission_year  current gpa               contact_email  mobile number  \
0        

In [3]:
import pandas as pd 
de=pd.read_json('../../data/enrollments.json')
print(de)
print(de.describe())

               enrollment_id stud_ref_id subject_code           course_title  \
0    ENR-STU001-MATH101-AB12      STU001      MATH101             Calculus I   
1     ENR-STU002-ENG201-CD34      STU002       ENG201     English Literature   
2      ENR-STU003-CS301-EF56      STU003        CS301        Programming 101   
3    ENR-STU004-HIST105-GH78      STU004      HIST105          World History   
4    ENR-STU005-CHEM110-IJ90      STU005      CHEM110      Organic Chemistry   
..                       ...         ...          ...                    ...   
96   ENR-STU097-MATH101-MN34      STU097      MATH101             Calculus I   
97    ENR-STU098-ENG201-OP56      STU098       ENG201     English Literature   
98     ENR-STU099-CS301-QR78      STU099        CS301        Programming 101   
99   ENR-STU100-HIST105-ST90      STU100      HIST105          World History   
100  ENR-STU004-HIST999-DUP1      STU004      HIST999  Ancient Civilizations   

    instructor_name  enroll_count term_

## **2. View the First and Last Few Rows of Each DataFrame**

Check out how your data looks. One method previews the first few records, while another method previews the last few. You can specify the number of rows you want to see by explicitly passing an integer argument.

**📝 Tip:** This is a great time to confirm that columns loaded correctly and to spot any obvious data issues (strange values, mismatched columns, etc.).

In [76]:
print(df.head(1))

  student_id First_Name last_name Birthddate gender        majorField  \
0     STU001       John       Doe  4/12/1998      M  Computer Science   

  admission_year  current gpa         contact_email  mobile number home_city  \
0           2020          3.5  john.doe@example.com         -655.0     Tampa   

  HOME COUNTRY  
0          USA  


In [77]:
print(df.iloc[-1])


student_id                          STU100
First_Name                          Amelia
last_name                            Lopez
Birthddate                       12/6/1996
gender                                   F
majorField                         History
admission_year                        2020
current gpa                            3.3
contact_email     amelia.lopez@example.com
mobile number                       -754.0
home_city                          Raleigh
HOME COUNTRY                           USA
Name: 102, dtype: object


In [78]:
print(de.tail(1))

               enrollment_id stud_ref_id subject_code           course_title  \
100  ENR-STU004-HIST999-DUP1      STU004      HIST999  Ancient Civilizations   

    instructor_name  enroll_count term_offered course_fee final_result  \
100     Prof. Brown            99    Fall 2023        NaN                

     attend_percentage date_enrolled  
100                 67    2023-09-01  


In [79]:
print(de.iloc[-1])

enrollment_id        ENR-STU004-HIST999-DUP1
stud_ref_id                           STU004
subject_code                         HIST999
course_title           Ancient Civilizations
instructor_name                  Prof. Brown
enroll_count                              99
term_offered                       Fall 2023
course_fee                               NaN
final_result                                
attend_percentage                         67
date_enrolled                     2023-09-01
Name: 100, dtype: object


## **3. Check the Shape of Each DataFrame**

To understand the **size** of your dataset(s), use the attribute that returns `(number_of_rows, number_of_columns)`.

**📝 Tip:** Note any big differences in row counts that might affect merging later.


In [87]:
print(df.shape)
num_rows=len(df)
print (num_rows)


(103, 12)
103


In [89]:
print(de.shape)
print( de.head(1))
print(len(de.head(1)))



(101, 11)
             enrollment_id stud_ref_id subject_code course_title  \
0  ENR-STU001-MATH101-AB12      STU001      MATH101   Calculus I   

  instructor_name  enroll_count term_offered course_fee final_result  \
0       Dr. Smith            25  Spring 2026       1000            A   

   attend_percentage date_enrolled  
0                 71    2026/01/15  
1


## **4. Get a Summary of Each DataFrame**

Explore one or two approaches that provide:

-   **Column names**
-   **Data types**
-   **Basic statistics about numerical columns**
-   **Number of non-null values**

**📝 Tip:** One approach might give an overview of columns and data types; another might summarize numerical columns. This step helps you detect columns that might need cleaning.

In [90]:
print( de.head(0))

Empty DataFrame
Columns: [enrollment_id, stud_ref_id, subject_code, course_title, instructor_name, enroll_count, term_offered, course_fee, final_result, attend_percentage, date_enrolled]
Index: []


In [93]:
print(de.dtypes)
print(df.info)

enrollment_id        object
stud_ref_id          object
subject_code         object
course_title         object
instructor_name      object
enroll_count          int64
term_offered         object
course_fee           object
final_result         object
attend_percentage     int64
date_enrolled        object
dtype: object
<bound method DataFrame.info of     student_id First_Name last_name  Birthddate gender        majorField  \
0       STU001       John       Doe   4/12/1998      M  Computer Science   
1       STU002      Maria  Gonzalez    9/5/1997      F           Biology   
2       STU003      Priya     Patel   1/23/1999      F       Engineering   
3       STU004       Alex   Johnson  12/15/1996      M       Mathematics   
4       STU005      Emily     Smith   7/30/2000      F           Physics   
..         ...        ...       ...         ...    ...               ...   
98      STU096   Victoria     Ortiz    8/2/1996      F       Mathematics   
99      STU097     Julian    Foster   

In [92]:
print(df.info)

<bound method DataFrame.info of     student_id First_Name last_name  Birthddate gender        majorField  \
0       STU001       John       Doe   4/12/1998      M  Computer Science   
1       STU002      Maria  Gonzalez    9/5/1997      F           Biology   
2       STU003      Priya     Patel   1/23/1999      F       Engineering   
3       STU004       Alex   Johnson  12/15/1996      M       Mathematics   
4       STU005      Emily     Smith   7/30/2000      F           Physics   
..         ...        ...       ...         ...    ...               ...   
98      STU096   Victoria     Ortiz    8/2/1996      F       Mathematics   
99      STU097     Julian    Foster    9/3/1998      M           Physics   
100     STU098       Lucy   Ramirez   10/4/1997      F         Chemistry   
101     STU099     Isaiah       Kim   11/5/1999      M         Economics   
102     STU100     Amelia     Lopez   12/6/1996      F           History   

    admission_year  current gpa               contact_e

In [94]:
print(df.info)

<bound method DataFrame.info of     student_id First_Name last_name  Birthddate gender        majorField  \
0       STU001       John       Doe   4/12/1998      M  Computer Science   
1       STU002      Maria  Gonzalez    9/5/1997      F           Biology   
2       STU003      Priya     Patel   1/23/1999      F       Engineering   
3       STU004       Alex   Johnson  12/15/1996      M       Mathematics   
4       STU005      Emily     Smith   7/30/2000      F           Physics   
..         ...        ...       ...         ...    ...               ...   
98      STU096   Victoria     Ortiz    8/2/1996      F       Mathematics   
99      STU097     Julian    Foster    9/3/1998      M           Physics   
100     STU098       Lucy   Ramirez   10/4/1997      F         Chemistry   
101     STU099     Isaiah       Kim   11/5/1999      M         Economics   
102     STU100     Amelia     Lopez   12/6/1996      F           History   

    admission_year  current gpa               contact_e

## **5. Check for Missing Values**

Determine if your dataset has any missing or null values by **counting** them. Notice which columns have many missing entries and plan how to handle them.

**📝 Tip:** Some columns might look present but contain empty strings. Identify them if possible.


In [18]:
print( df.isnull() != True)

     student_id  First_Name  last_name  Birthddate  gender  majorField  \
0          True        True       True        True    True        True   
1          True        True       True        True    True        True   
2          True        True       True        True    True        True   
3          True        True       True        True    True        True   
4          True        True       True        True    True        True   
..          ...         ...        ...         ...     ...         ...   
98         True        True       True        True    True        True   
99         True        True       True        True    True        True   
100        True        True       True        True    True        True   
101        True        True       True        True    True        True   
102        True        True       True        True    True        True   

     admission_year  current gpa  contact_email  mobile number  home_city  \
0              True         True  

In [96]:
print(de.isnull().sum())
print(df.isnull().sum())

enrollment_id        0
stud_ref_id          0
subject_code         0
course_title         0
instructor_name      0
enroll_count         0
term_offered         0
course_fee           1
final_result         0
attend_percentage    0
date_enrolled        0
dtype: int64
student_id        0
First_Name        0
last_name         0
Birthddate        0
gender            0
majorField        1
admission_year    0
current gpa       8
contact_email     0
mobile number     6
home_city         0
HOME COUNTRY      0
dtype: int64


In [2]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
print(df.isna())
print(df.isna().sum())

     student_id  First_Name  last_name  Birthddate  gender  majorField  \
0         False       False      False       False   False       False   
1         False       False      False       False   False       False   
2         False       False      False       False   False       False   
3         False       False      False       False   False       False   
4         False       False      False       False   False       False   
..          ...         ...        ...         ...     ...         ...   
98        False       False      False       False   False       False   
99        False       False      False       False   False       False   
100       False       False      False       False   False       False   
101       False       False      False       False   False       False   
102       False       False      False       False   False       False   

     admission_year  current gpa  contact_email  mobile number  home_city  \
0             False        False  

## **6. Rename Columns for Clarity and Consistency**

Some columns may have **spaces** or **capitalization** that complicates your analysis. For example, if you see `"current gpa"` or `"First_Name"`, consider renaming them (e.g., `"current_gpa"`, `"first_name"`) for ease of use.

**📝 Tip:** Consistent naming conventions help minimize typos and KeyErrors.

In [4]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
#df.rename(columns={'current gpa': 'current_gpa'}, inplace=True)
df.columns = ['current_gpa', 'current gpa']

ValueError: Length mismatch: Expected axis has 12 elements, new values have 2 elements

In [6]:
df.rename(columns={'First_Name': 'first_name'}, inplace=True)
df.columns = ['First_Name', 'first_name']

ValueError: Length mismatch: Expected axis has 12 elements, new values have 2 elements

In [3]:
print(df)

    student_id First_Name last_name  Birthddate gender        majorField  \
0       STU001       John       Doe   4/12/1998      M  Computer Science   
1       STU002      Maria  Gonzalez    9/5/1997      F           Biology   
2       STU003      Priya     Patel   1/23/1999      F       Engineering   
3       STU004       Alex   Johnson  12/15/1996      M       Mathematics   
4       STU005      Emily     Smith   7/30/2000      F           Physics   
..         ...        ...       ...         ...    ...               ...   
98      STU096   Victoria     Ortiz    8/2/1996      F       Mathematics   
99      STU097     Julian    Foster    9/3/1998      M           Physics   
100     STU098       Lucy   Ramirez   10/4/1997      F         Chemistry   
101     STU099     Isaiah       Kim   11/5/1999      M         Economics   
102     STU100     Amelia     Lopez   12/6/1996      F           History   

    admission_year  current gpa               contact_email  mobile number  \
0        

In [20]:
# your code here

## **7. Convert Data Types Where Needed**

Check which columns should be numeric or datetime. Columns like `admission_year` or `course_fee` might be read as **strings** by default. Convert them to numerical or date formats if necessary.

**📝 Tip:** Make sure you handle errors gracefully (e.g., set `errors='coerce'` to turn invalid entries into NaN).

In [7]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
df['course_fee'] = pd.to_numeric(df['course_fee'], errors='coerce')

KeyError: 'course_fee'

In [22]:
df['admission_year'] = pd.to_datetime(df['admission_year'], errors='coerce')

print(df.dtypes)

In [23]:
# your code here

In [24]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )

## **8. Fill Missing Values with a Specified Value or Method**

Instead of **dropping** missing values, consider **replacing** them. For instance:

-   A string like `"Unknown"` for missing text
-   A **mean** or **median** for missing numeric columns
-   A **forward** or **backward fill** if appropriate

In [25]:
# your code here

In [26]:
# your code here

In [27]:
# your code here

## **9. Drop Rows or Columns with Missing Values (If Needed)**

After considering which values can be filled, you might choose to **remove** rows or columns that are missing too much data or can’t be fixed.

**📝 Tip:** Decide carefully and confirm you don’t need the dropped information. Use `inplace=True` or keep a separate DataFrame if you want to preserve the original data.

In [12]:
import pandas as pd
import numpy as np
df =pd.read_csv('../../data/students.csv' )
df_cleaned = df.dropna(axis=1)

print(df_cleaned)

    student_id First_Name last_name  Birthddate gender admission_year  \
0       STU001       John       Doe   4/12/1998      M           2020   
1       STU002      Maria  Gonzalez    9/5/1997      F           2019   
2       STU003      Priya     Patel   1/23/1999      F           2021   
3       STU004       Alex   Johnson  12/15/1996      M           2018   
4       STU005      Emily     Smith   7/30/2000      F           2022   
..         ...        ...       ...         ...    ...            ...   
98      STU096   Victoria     Ortiz    8/2/1996      F           2020   
99      STU097     Julian    Foster    9/3/1998      M           2022   
100     STU098       Lucy   Ramirez   10/4/1997      F           2019   
101     STU099     Isaiah       Kim   11/5/1999      M           2021   
102     STU100     Amelia     Lopez   12/6/1996      F           2020   

                  contact_email  home_city HOME COUNTRY       Full_name  
0          john.doe@example.com      Tampa       

In [10]:
import pandas as pd
import numpy as np

de=pd.read_json('../../data/enrollments.json')
de_cleaned = de.dropna(axis=1)

print(de_cleaned)

               enrollment_id stud_ref_id subject_code           course_title  \
0    ENR-STU001-MATH101-AB12      STU001      MATH101             Calculus I   
1     ENR-STU002-ENG201-CD34      STU002       ENG201     English Literature   
2      ENR-STU003-CS301-EF56      STU003        CS301        Programming 101   
3    ENR-STU004-HIST105-GH78      STU004      HIST105          World History   
4    ENR-STU005-CHEM110-IJ90      STU005      CHEM110      Organic Chemistry   
..                       ...         ...          ...                    ...   
96   ENR-STU097-MATH101-MN34      STU097      MATH101             Calculus I   
97    ENR-STU098-ENG201-OP56      STU098       ENG201     English Literature   
98     ENR-STU099-CS301-QR78      STU099        CS301        Programming 101   
99   ENR-STU100-HIST105-ST90      STU100      HIST105          World History   
100  ENR-STU004-HIST999-DUP1      STU004      HIST999  Ancient Civilizations   

    instructor_name  enroll_count term_

In [30]:
# your code here

In [31]:
# your code here

## **10. Filter Rows Based on a Condition**

Now that columns like `admission_year` and `course_fee` (or `current_gpa`) are numeric, experiment with filtering. For example:

-   Students whose `admission_year` is after a certain date
-   Enrollments for `Spring 2026`

In [14]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
df_filtered = df['admission_year'].dt.year > 2021

AttributeError: Can only use .dt accessor with datetimelike values

In [16]:
df_filtered = df['current_gpa'] >= 2.0
print(df_filtered.info)

<bound method Series.info of 0      True
1      True
2      True
3      True
4      True
       ... 
98     True
99     True
100    True
101    True
102    True
Name: current_gpa, Length: 103, dtype: bool>


In [34]:
# your code here

In [35]:
# your code here

## **11. Select Specific Columns from Each DataFrame**

Often, you don’t need all columns at once. For instance, you might extract only:

-   `"student_id"`, `"First_Name"`, `"last_name"`, and `"current_gpa"` from `students.csv`
-   `"stud_ref_id"`, `"course_title"`, `"instructor_name"`, `"course_fee"` from `enrollments.json`

In [29]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
print( df [['student_id', 'First_Name' , 'last_name'  ]])
print( df [['student_id', 'First_Name', 'last_name' , 'current gpa']])

    student_id First_Name last_name
0       STU001       John       Doe
1       STU002      Maria  Gonzalez
2       STU003      Priya     Patel
3       STU004       Alex   Johnson
4       STU005      Emily     Smith
..         ...        ...       ...
98      STU096   Victoria     Ortiz
99      STU097     Julian    Foster
100     STU098       Lucy   Ramirez
101     STU099     Isaiah       Kim
102     STU100     Amelia     Lopez

[103 rows x 3 columns]
    student_id First_Name last_name  current gpa
0       STU001       John       Doe          3.5
1       STU002      Maria  Gonzalez          3.8
2       STU003      Priya     Patel          3.7
3       STU004       Alex   Johnson          3.2
4       STU005      Emily     Smith          3.9
..         ...        ...       ...          ...
98      STU096   Victoria     Ortiz          3.7
99      STU097     Julian    Foster          3.8
100     STU098       Lucy   Ramirez          3.9
101     STU099     Isaiah       Kim          3.2
102  

In [37]:
# your code here

In [38]:
# your code here

In [39]:
# your code here

## **12. Sort the DataFrame by One or More Columns**

Sorting can help you identify which records have the highest or lowest values. For example:

-   Sort the **students** DataFrame by `"current_gpa"` in descending order
-   Sort the **enrollments** DataFrame by `"course_fee"` in ascending order

**📝 Tip:** You can sort by multiple columns if needed.


In [17]:
df_sorted = df.sort_values(by='current_gpa', ascending=False)
print(df_sorted)

   student_id first_name  last_name  Birthddate gender   majorField  \
68     STU066   Victoria     Morris    2/2/1997      F    Chemistry   
52     STU050     Amelia      Scott    9/9/1997      F    Chemistry   
24     STU025     Elijah  Hernandez  12/30/1998      M      Physics   
92     STU090      Molly     Barnes   2/26/1997      F    Chemistry   
31     STU032      Grace      Moore   2/28/1996      F  Mathematics   
..        ...        ...        ...         ...    ...          ...   
70     STU068       Zoey       Reed    4/4/1996      F      History   
74     STU072      Layla     Murphy    8/8/1996      F  Mathematics   
81     STU079       Owen       Ward   3/15/1999      M  Engineering   
89     STU087      Wyatt      Price  11/23/1999      M  Engineering   
94     STU092     Claire     Powell   4/28/1996      F      History   

   admission_year  current_gpa                 contact_email  mobile number  \
68           2019          3.9   victoria.morris@example.com        

In [19]:
de_sorted = pd.to_numeric(de.sort_values(by='course_fee'))
print(de_sorted)

TypeError: '<' not supported between instances of 'int' and 'str'

In [42]:
# your code here

In [43]:
# your code here

## **13. Group Data by a Column and Compute Aggregate Functions**

Grouping lets you see aggregated info by category. For example, group **students** by `"majorField"` and compute the average `"current_gpa"`. In **enrollments**, group by `"instructor_name"` and compute the average `"course_fee"`.

**📝 Tip:** Aggregations might include `.mean()`, `.sum()`, `.count()`, etc.

In [20]:
df_grouped = df.groupby('majorField')['current_gpa'].mean().reset_index()

print(df_grouped)

         majorField  current_gpa
0           Biology     3.515385
1         Chemistry     3.850000
2  Computer Science     3.450000
3         Economics     3.290909
4       Engineering     3.563636
5           History     3.344444
6       Mathematics     3.638462
7           Physics     3.753846


In [21]:
df_grouped = df.groupby('instructor_name')['course_fee'].mean().reset_index()

print(df_grouped)

KeyError: 'instructor_name'

In [46]:
# your code here

In [47]:
# your code here

## **14. Apply a Custom Function**

Define a normal Python function to transform data in a column. For example, title-case a name or uppercase a field. Apply that function to each element in the column.

**📝 Tip:** If your function references another library call or has complex logic, define it above and then use `.apply(...)` with your function name. Once you've done this, see if you do this using lamda notation. 

In [23]:
import pandas as pd

de=pd.read_json('../../data/enrollments.json')
de_title = de.title()

print(de_title)

AttributeError: 'DataFrame' object has no attribute 'title'

In [49]:
# your code here

In [50]:
# your code here

In [51]:
# your code here

## **15. Create a New Column Based on Existing Ones**

Use existing columns to generate new ones. For instance, combine `"First_Name"` and `"last_name"` into `"full_name"`, or compute `"fees_after_tax"` in enrollments if you assume a tax rate.

In [6]:
import pandas as pd 
df =pd.read_csv('../../data/students.csv' )
df['Full_name']= df['First_Name'] + ' ' + df['last_name']
print(df)

    student_id First_Name last_name  Birthddate gender        majorField  \
0       STU001       John       Doe   4/12/1998      M  Computer Science   
1       STU002      Maria  Gonzalez    9/5/1997      F           Biology   
2       STU003      Priya     Patel   1/23/1999      F       Engineering   
3       STU004       Alex   Johnson  12/15/1996      M       Mathematics   
4       STU005      Emily     Smith   7/30/2000      F           Physics   
..         ...        ...       ...         ...    ...               ...   
98      STU096   Victoria     Ortiz    8/2/1996      F       Mathematics   
99      STU097     Julian    Foster    9/3/1998      M           Physics   
100     STU098       Lucy   Ramirez   10/4/1997      F         Chemistry   
101     STU099     Isaiah       Kim   11/5/1999      M         Economics   
102     STU100     Amelia     Lopez   12/6/1996      F           History   

    admission_year  current gpa               contact_email  mobile number  \
0        

In [53]:
# your code here

In [39]:
import pandas as pd 
de=pd.read_json('../../data/enrollments.json')
de['taxed_fee'] = de.apply(
    lambda row: row['course_fee'] * 1.1 
        if row['course_fee'].isna() 
        else 0 ,
    axis=1 
)
print (de)
    

AttributeError: 'int' object has no attribute 'isna'

In [55]:
# your code here

## **16. Merge Two DataFrames on a Common Column**

Combine `students.csv` and `enrollments.json` by matching:

-   `stu["student_id"]`
-   `enr["stud_ref_id"]` (or rename it first)

Check the shape of the merged DataFrame afterward to ensure it merged as expected.


In [1]:
import pandas as pd 
de=pd.read_json('../../data/enrollments.json')
df =pd.read_csv('../../data/students.csv' )
merged_df = pd.merge(de, df, left_on='stud_ref_id', right_on='student_id', how='inner')  # 'inner' keeps only matching IDs
print(merged_df)
print(merged_df.describe())


               enrollment_id stud_ref_id subject_code           course_title  \
0    ENR-STU001-MATH101-AB12      STU001      MATH101             Calculus I   
1     ENR-STU002-ENG201-CD34      STU002       ENG201     English Literature   
2      ENR-STU003-CS301-EF56      STU003        CS301        Programming 101   
3    ENR-STU004-HIST105-GH78      STU004      HIST105          World History   
4    ENR-STU005-CHEM110-IJ90      STU005      CHEM110      Organic Chemistry   
..                       ...         ...          ...                    ...   
99   ENR-STU097-MATH101-MN34      STU097      MATH101             Calculus I   
100   ENR-STU098-ENG201-OP56      STU098       ENG201     English Literature   
101    ENR-STU099-CS301-QR78      STU099        CS301        Programming 101   
102  ENR-STU100-HIST105-ST90      STU100      HIST105          World History   
103  ENR-STU004-HIST999-DUP1      STU004      HIST999  Ancient Civilizations   

    instructor_name  enroll_count term_

In [57]:
# your code here

In [58]:
# your code here

In [59]:
# your code here

## **17. Remove Duplicate Rows**

When merging or concatenating multiple files, duplicates can crop up. Identify them and remove if needed. This might be especially important if the same student or enrollment is listed more than once.

In [60]:
# your code here

In [61]:
df_unique = merged_df.drop_duplicates()

print(df_unique)

In [62]:
# your code here

In [63]:
# your code here

## **18. Additional Data Cleaning**

Now that you’ve merged or manipulated your data, do a quick final pass:

-   Fix any remaining oddities (e.g., negative phone numbers or impossible dates)
-   Normalize columns further (e.g., standardize text formatting)

**📝 Tip:** You might revisit previous steps if new issues appear.


In [64]:
import pandas as pd 
de=pd.read_json('../../data/enrollments.json')
df =pd.read_csv('../../data/students.csv' )


In [65]:
# your code here

In [66]:
# your code here

In [67]:
# your code here

## **19. Save the Cleaned and Merged DataFrame to a New CSV File**

Finally, when you’re satisfied with your cleaned data, save it. Remember to avoid writing the index as a separate column unless you want it.

In [68]:
import pandas as pd 
de=pd.read_json('../../data/enrollments.json')
df =pd.read_csv('../../data/students.csv' )

In [69]:
# your code here

In [70]:
# your code here

In [71]:
# your code here

## **20. Explore Further Analyses (Optional)**

Now that your data is in great shape, try some optional challenges:

-   Generate charts or visualizations
-   Perform advanced filtering or grouping
-   Create pivot tables
-   Or anything else that interests you!

In [72]:
import pandas as pd 
de=pd.read_json('../../data/enrollments.json')
df =pd.read_csv('../../data/students.csv' )

In [73]:
# your code here

In [74]:
# your code here

In [75]:
# your code here

**🎉 Congratulations!** You’ve now tackled **data cleaning** and many essential **Pandas** operations in `students.csv` and `enrollments.json`. Keep experimenting to sharpen your **data manipulation skills** and unlock deeper insights! 💪