# Introduction to Pandas Lab

Complete the following set of exercises to solidify your knowledge of Pandas fundamentals.

### 1. Import Numpy and Pandas and alias them to `np` and `pd` respectively.

In [1]:
import pandas as pd
import numpy as np

### 2. Create a Pandas Series containing the elements of the list below.

In [3]:
# Defining the list of numbers (I added this comment, list was given)

lst = [5.7, 75.2, 74.4, 84.0, 66.5, 66.3, 55.8, 75.7, 29.1, 43.7]

In [63]:
# your code here

# Creating a Pandas Series from the list
# A Series is like a 1-dimensional array with labels, where each element in the list
# will be assigned an index (0 to 9, by default).
series = pd.Series(lst)
print(series)

0     5.7
1    75.2
2    74.4
3    84.0
4    66.5
5    66.3
6    55.8
7    75.7
8    29.1
9    43.7
dtype: float64


| **Aspect**         | **Python List**                            | **Pandas Series**                          |
|---------------------|--------------------------------------------|--------------------------------------------|

| **Definition**      | A built-in Python data structure that     | A one-dimensional labeled array in the     |
|                     | holds a collection of items.              | Pandas library.                            |

| **Data Structure**  | Basic data structure in Python.           | Part of the Pandas library, specialized    |
|                     |                                            | for data analysis.                         |

| **Indexing**        | Implicit indexing (0, 1, 2, ...) but      | Explicitly indexed (default is 0, 1, 2,    |
|                     | cannot have custom indices.               | ...), and you can define custom indices.   |

| **Data Type**       | Can hold multiple data types in the       | Preferably holds a single data type (e.g., |
|                     | same list (e.g., strings, integers,       | all floats or all strings). Mixed types    |
|                     | floats).                                  | are possible but not ideal.                |

| **Operations**      | Limited operations (must use loops or     | Optimized for mathematical and statistical |
|                     | list comprehensions for many tasks).      | operations (applies functions directly to  |
|                     |                                            | the data).                                 |

| **Metadata**        | No metadata; just raw data.               | Contains metadata like index and data      |
|                     |                                            | type (`dtype`).                            |

| **Size**            | Limited by Python's list memory overhead. | More memory-efficient for numerical data.  |

| **Performance**     | Slower for large-scale computations.      | Faster for large-scale data operations     |
|                     |                                            | due to vectorization and underlying C      |
|                     |                                            | implementation.                            |

| **Flexibility**     | General-purpose collection.               | Designed specifically for data manipulation|
|                     |                                            | and analysis.                              |


### 3. Use indexing to return the third value in the Series above.

*Hint: Remember that indexing begins at 0.*

In [65]:
# your code here

# Use indexing to get the third value (index 2)
third_value = series[2]

# Print the third value
print(third_value)

74.4


### 4. Create a Pandas DataFrame from the list of lists below. Each sublist should be represented as a row.

In [8]:
#Given list of lists (I added this)

b = [[53.1, 95.0, 67.5, 35.0, 78.4],
     [61.3, 40.8, 30.8, 37.8, 87.6],
     [20.6, 73.2, 44.2, 14.6, 91.8],
     [57.4, 0.1, 96.1, 4.2, 69.5],
     [83.6, 20.5, 85.4, 22.8, 35.9],
     [49.0, 69.0, 0.1, 31.8, 89.1],
     [23.3, 40.7, 95.0, 83.8, 26.9],
     [27.6, 26.4, 53.8, 88.8, 68.5],
     [96.6, 96.4, 53.4, 72.4, 50.1],
     [73.7, 39.0, 43.2, 81.6, 34.7]]

In [69]:
# your code here

# Create a DataFrame from the list of lists, each sublist becomes a row.
df = pd.DataFrame(b)

# Display the DataFrame
print(df)

      0     1     2     3     4
0  53.1  95.0  67.5  35.0  78.4
1  61.3  40.8  30.8  37.8  87.6
2  20.6  73.2  44.2  14.6  91.8
3  57.4   0.1  96.1   4.2  69.5
4  83.6  20.5  85.4  22.8  35.9
5  49.0  69.0   0.1  31.8  89.1
6  23.3  40.7  95.0  83.8  26.9
7  27.6  26.4  53.8  88.8  68.5
8  96.6  96.4  53.4  72.4  50.1
9  73.7  39.0  43.2  81.6  34.7


### 5. Rename the data frame columns based on the names in the list below.

In [11]:
b = [[53.1, 95.0, 67.5, 35.0, 78.4],
     [61.3, 40.8, 30.8, 37.8, 87.6],
     [20.6, 73.2, 44.2, 14.6, 91.8],
     [57.4, 0.1, 96.1, 4.2, 69.5],
     [83.6, 20.5, 85.4, 22.8, 35.9],
     [49.0, 69.0, 0.1, 31.8, 89.1],
     [23.3, 40.7, 95.0, 83.8, 26.9],
     [27.6, 26.4, 53.8, 88.8, 68.5],
     [96.6, 96.4, 53.4, 72.4, 50.1],
     [73.7, 39.0, 43.2, 81.6, 34.7]]

In [75]:
# your code here

# Given list of column names
column_names = ["Score 1", "Score 2", "Score 3", "Score 4", "Score 5"]

# Rename the DataFrame columns using the provided list
df.columns = column_names

# Display the renamed DataFrame
print(df)


   Score 1  Score 2  Score 3  Score 4  Score 5
0     53.1     95.0     67.5     35.0     78.4
1     61.3     40.8     30.8     37.8     87.6
2     20.6     73.2     44.2     14.6     91.8
3     57.4      0.1     96.1      4.2     69.5
4     83.6     20.5     85.4     22.8     35.9
5     49.0     69.0      0.1     31.8     89.1
6     23.3     40.7     95.0     83.8     26.9
7     27.6     26.4     53.8     88.8     68.5
8     96.6     96.4     53.4     72.4     50.1
9     73.7     39.0     43.2     81.6     34.7


### 6. Create a subset of this data frame that contains only the Score 1, 3, and 5 columns.

In [77]:
# your code here

# Create a subset of the DataFrame with the specified columns
subset_df = df[["Score 1", "Score 3", "Score 5"]]

# Display the subset DataFrame
print(subset_df)


   Score 1  Score 3  Score 5
0     53.1     67.5     78.4
1     61.3     30.8     87.6
2     20.6     44.2     91.8
3     57.4     96.1     69.5
4     83.6     85.4     35.9
5     49.0      0.1     89.1
6     23.3     95.0     26.9
7     27.6     53.8     68.5
8     96.6     53.4     50.1
9     73.7     43.2     34.7


### 7. From the original data frame, calculate the average Score_3 value.

In [79]:
# your code here

# Calculate the average of Score 3
average_score_3 = df["Score 3"].mean()

# Display the result
print("Average Score 3:", average_score_3)

Average Score 3: 56.95000000000001


### 8. From the original data frame, calculate the maximum Score_4 value.

In [81]:
# your code here

# Calculate the maximum of Score 4
max_score_4 = df["Score 4"].max()

# Display the result
print("Maximum Score 4:", max_score_4)


Maximum Score 4: 88.8


### 9. From the original data frame, calculate the median Score 2 value.

In [83]:
# your code here

# Calculate the median of Score 2
median_score_2 = df["Score 2"].median()

# Display the result
print("Median Score 2:", median_score_2)

Median Score 2: 40.75


### 10. Create a Pandas DataFrame from the dictionary of product orders below.

In [85]:
orders = {'Description': ['LUNCH BAG APPLE DESIGN',
  'SET OF 60 VINTAGE LEAF CAKE CASES ',
  'RIBBON REEL STRIPES DESIGN ',
  'WORLD WAR 2 GLIDERS ASSTD DESIGNS',
  'PLAYING CARDS JUBILEE UNION JACK',
  'POPCORN HOLDER',
  'BOX OF VINTAGE ALPHABET BLOCKS',
  'PARTY BUNTING',
  'JAZZ HEARTS ADDRESS BOOK',
  'SET OF 4 SANTA PLACE SETTINGS'],
 'Quantity': [1, 24, 1, 2880, 2, 7, 1, 4, 10, 48],
 'UnitPrice': [1.65, 0.55, 1.65, 0.18, 1.25, 0.85, 11.95, 4.95, 0.19, 1.25],
 'Revenue': [1.65, 13.2, 1.65, 518.4, 2.5, 5.95, 11.95, 19.8, 1.9, 60.0]}

In [87]:
# your code here

# Create a DataFrame from the dictionary
df_orders = pd.DataFrame(orders)

# Display the DataFrame
print(df_orders)

                          Description  Quantity  UnitPrice  Revenue
0              LUNCH BAG APPLE DESIGN         1       1.65     1.65
1  SET OF 60 VINTAGE LEAF CAKE CASES         24       0.55    13.20
2         RIBBON REEL STRIPES DESIGN          1       1.65     1.65
3   WORLD WAR 2 GLIDERS ASSTD DESIGNS      2880       0.18   518.40
4    PLAYING CARDS JUBILEE UNION JACK         2       1.25     2.50
5                      POPCORN HOLDER         7       0.85     5.95
6      BOX OF VINTAGE ALPHABET BLOCKS         1      11.95    11.95
7                       PARTY BUNTING         4       4.95    19.80
8            JAZZ HEARTS ADDRESS BOOK        10       0.19     1.90
9       SET OF 4 SANTA PLACE SETTINGS        48       1.25    60.00


### 11. Calculate the total quantity ordered and revenue generated from these orders.

In [91]:
# your code here

#Calculate total quantity and revenue
total_quantity = df_orders['Quantity'].sum()
total_revenue = df_orders['Revenue'].sum()

print(f"Total Quantity Ordered: {total_quantity}")
print(f"Total Revenue Generated: ${total_revenue:.2f}")

Total Quantity Ordered: 2978
Total Revenue Generated: $637.00


### 12. Obtain the prices of the most expensive and least expensive items ordered and print the difference.

In [93]:
# your code here

#Most expensive and least expensive items
most_expensive_price = df_orders['UnitPrice'].max()
least_expensive_price = df_orders['UnitPrice'].min()

price_difference = most_expensive_price - least_expensive_price

print(f"Most Expensive Item Price: ${most_expensive_price:.2f}")
print(f"Least Expensive Item Price: ${least_expensive_price:.2f}")
print(f"Price Difference: ${price_difference:.2f}")

Most Expensive Item Price: $11.95
Least Expensive Item Price: $0.18
Price Difference: $11.77


## Let's load another dataset for more exercisesº

In [95]:
# Run this code:
admissions = pd.read_csv('../Admission_Predict.csv')

Let's evaluate the dataset by looking at the `head` function.

In [99]:
# your code here
print(admissions.head())

   Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
0           1        337          118                  4  4.5   4.5  9.65   
1           2        316          104                  3  3.0   3.5  8.00   
2           3        322          110                  3  3.5   2.5  8.67   
3           4        314          103                  2  2.0   3.0  8.21   
4           5        330          115                  5  4.5   3.0  9.34   

   Research  Chance of Admit   
0         1              0.92  
1         1              0.72  
2         1              0.80  
3         0              0.65  
4         1              0.90  


### 1 - Before beginning to work with this dataset and evaluating graduate admissions data, we will verify that there is no missing data in the dataset. Do this in the cell below.

In [101]:
# your code here

# Check for missing data
missing_data = admissions.isnull().sum()

# Display the number of missing values in each column
print(missing_data)


Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64


###  2 -  Interestingly, there is a column that uniquely identifies the applicants. This column is the serial number column. Instead of having our own index, we should make this column our index. Do this in the cell below. Keep the column in the dataframe in addition to making it an index.

In [103]:
# your code here

admissions.set_index('Serial No.', inplace=True)
print(admissions.head())

            GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
Serial No.                                                               
1                 337          118                  4  4.5   4.5  9.65   
2                 316          104                  3  3.0   3.5  8.00   
3                 322          110                  3  3.5   2.5  8.67   
4                 314          103                  2  2.0   3.0  8.21   
5                 330          115                  5  4.5   3.0  9.34   

            Research  Chance of Admit   
Serial No.                              
1                  1              0.92  
2                  1              0.72  
3                  1              0.80  
4                  0              0.65  
5                  1              0.90  


Turns out that `GRE Score` and `CGPA` also uniquely identify the data. Show this in the cell below.

In [122]:
#I added this cell

# your code here

is_unique = admissions.duplicated(subset=['GRE Score', 'CGPA']).sum() == 0
print(f"Do GRE Score and CGPA uniquely identify the data? {is_unique}")

Do GRE Score and CGPA uniquely identify the data? True


### 3 - In this part of the lab, we would like to test complex conditions on the entire data set at once. Let's start by finding the number of rows where the CGPA is greater than 9 and the student has performed an investigation.

In [126]:
# Filter rows where CGPA > 9 and Research == 1
cgpa_research_filter = (admissions['CGPA'] > 9) & (admissions['Research'] == 1)

# Display filtered rows for clarity
filtered_rows = admissions[cgpa_research_filter]
print("Rows where CGPA > 9 and Research == 1:")
print(filtered_rows)

# Count the number of rows matching the condition
count = filtered_rows.shape[0]
print(f"\nNumber of rows where CGPA > 9 and Research == 1: {count}")


Rows where CGPA > 9 and Research == 1:
            GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
Serial No.                                                               
1                 337          118                  4  4.5   4.5  9.65   
5                 330          115                  5  4.5   3.0  9.34   
11                328          112                  4  4.0   4.5  9.10   
20                328          116                  5  5.0   5.0  9.50   
21                334          119                  5  5.0   4.5  9.70   
...               ...          ...                ...  ...   ...   ...   
380               329          111                  4  4.5   4.0  9.23   
381               324          110                  3  3.5   3.5  9.04   
382               325          107                  3  3.0   3.5  9.11   
383               330          116                  4  5.0   4.5  9.45   
385               333          117                  4  5.0   4.0  9.66   

### 4 - Now return all the rows where the CGPA is greater than 9 and the SOP score is less than 3.5. Find the mean chance of admit for these applicants.

In [124]:
# your code here

# Filter rows where CGPA > 9 and SOP < 3.5
filtered_rows = admissions[(admissions['CGPA'] > 9) & (admissions['SOP'] < 3.5)]

# Calculate the mean Chance of Admit
mean_chance = filtered_rows['Chance of Admit '].mean()

print("Rows where CGPA > 9 and SOP < 3.5:")
print(filtered_rows)
print(f"\nMean chance of admit for these applicants: {mean_chance:.2f}")


Rows where CGPA > 9 and SOP < 3.5:
            GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
Serial No.                                                               
29                338          118                  4  3.0   4.5  9.40   
63                327          114                  3  3.0   3.0  9.02   
141               326          114                  3  3.0   3.0  9.11   
218               324          111                  4  3.0   3.0  9.01   
382               325          107                  3  3.0   3.5  9.11   

            Research  Chance of Admit   
Serial No.                              
29                 1              0.91  
63                 0              0.61  
141                1              0.83  
218                1              0.82  
382                1              0.84  

Mean chance of admit for these applicants: 0.80
