# Name: Iman Noor
## Submission Date: 28-06-2024

# **Data manipulation with Pandas (indexing, selection, grouping)**

1. **Indexing and Selection**:
   - **Indexing with []**: Access columns of a DataFrame.
   - **.loc[]**: Label-based indexing for selecting rows and columns.
   - **.iloc[]**: Positional indexing for selecting rows and columns.
   - **Boolean Indexing**: Filter rows based on conditions using boolean arrays.

2. **Grouping and Aggregation**:
   - **.groupby()**: Group data based on one or more columns.
   - **Aggregation functions**: Compute summary statistics (mean, sum, count) for each group.

3. **Data Manipulation Techniques**:
   - **.rename()**: Rename columns or index labels.
   - **.drop()**: Drop columns or rows from a DataFrame.
   - **.sort_values()**: Sort data based on one or more columns.

# **Practice Exercise**

## **Importing library**

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("diabetes.csv")
df.head() 

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## *by default head() gives 1st five rows*

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
df.shape

(768, 9)

In [5]:
df.describe() # giving summary

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [6]:
df.values

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

## *sorting 'Glucose'*

In [7]:
df_s = df.sort_values('Glucose')
df_s

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
75,1,0,48,20,0,24.7,0.140,22,0
502,6,0,68,41,0,39.0,0.727,41,1
349,5,0,80,32,0,41.0,0.346,37,1
342,1,0,68,35,0,32.0,0.389,22,0
182,1,0,74,20,23,27.7,0.299,21,0
...,...,...,...,...,...,...,...,...,...
228,4,197,70,39,744,36.7,2.329,31,0
408,8,197,74,0,0,25.9,1.191,39,1
8,2,197,70,45,543,30.5,0.158,53,1
561,0,198,66,32,274,41.3,0.502,28,1


## *subsetting columns*

In [8]:
# select age and blood pressure columns
age_bp = df[['Age', 'BloodPressure']]
age_bp.head()

Unnamed: 0,Age,BloodPressure
0,50,72
1,31,66
2,32,64
3,21,66
4,33,40


## *subsetting rows*

In [9]:
# filter for rows where age is greater than 50
df_age = df[df['Age']>50]
print("Total patients with age greater than 50: ",df_age['Age'].count())
df_age['Age']

Total patients with age greater than 50:  81


8      53
9      54
12     57
13     59
14     51
       ..
719    52
734    53
757    52
759    66
763    63
Name: Age, Length: 81, dtype: int64

In [10]:
print("Mean: ",df['Insulin'].mean())
print("Median: ",df['Insulin'].median())

Mean:  79.79947916666667
Median:  30.5


## *changing duplication*

In [11]:
duplicate = df[df.duplicated()]
duplicate

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


### *it means no duplication*

In [12]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

## **Grouping**

In [13]:
# groupby outcome and calcultion mean
summary_stats = df.groupby('Outcome').mean()
summary_stats

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [15]:
# groupby outcome and age and count no of occurences
age_outcome_count = df.groupby(['Age', 'Outcome']).size().reset_index(name='Count')
age_outcome_count

Unnamed: 0,Age,Outcome,Count
0,21,0,58
1,21,1,5
2,22,0,61
3,22,1,11
4,23,0,31
...,...,...,...
91,68,0,1
92,69,0,2
93,70,1,1
94,72,0,1


In [16]:
# group by 'BMI' and 'Outcome' and calculate mean of 'Glucose' and 'Insulin'
bmi_outcome_stats = df.groupby(['BMI', 'Outcome'])[['Glucose', 'Insulin']].mean().reset_index()
bmi_outcome_stats

Unnamed: 0,BMI,Outcome,Glucose,Insulin
0,0.0,0,100.777778,9.888889
1,0.0,1,120.000000,0.000000
2,18.2,0,92.333333,27.333333
3,18.4,0,104.000000,0.000000
4,19.1,0,80.000000,0.000000
...,...,...,...,...
353,53.2,1,162.000000,100.000000
354,55.0,1,88.000000,99.000000
355,57.3,0,123.000000,240.000000
356,59.4,1,180.000000,14.000000


# **Task 13: Data manipulation with Pandas (indexing, selection, grouping)**

## **Q. Load a DataFrame from a CSV file. Display the first and last five rows of the DataFrame.**

In [17]:
std_performance = pd.read_csv('Student_performance_data _.csv')
std_performance.head() # by default 1st five rows

Unnamed: 0,StudentID,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
0,1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1,1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
2,1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
3,1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
4,1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0


In [18]:
std_performance.tail()

Unnamed: 0,StudentID,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
2387,3388,18,1,0,3,10.680555,2,0,4,1,0,0,0,3.455509,0.0
2388,3389,17,0,0,1,7.583217,4,1,4,0,1,0,0,3.27915,4.0
2389,3390,16,1,0,2,6.8055,20,0,2,0,0,0,1,1.142333,2.0
2390,3391,16,1,1,0,12.416653,17,0,2,0,1,1,0,1.803297,1.0
2391,3392,16,1,0,2,17.819907,13,0,2,0,0,0,1,2.140014,1.0


## *if we want 7 rows then we can specify them as*


In [19]:
std_performance.head(7)

Unnamed: 0,StudentID,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
0,1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1,1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
2,1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
3,1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
4,1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0
5,1006,18,0,0,1,8.191219,0,0,1,1,0,0,0,3.084184,1.0
6,1007,15,0,1,1,15.60168,10,0,3,0,1,0,0,2.748237,2.0


## **Q. Set a specific column as the index of the DataFrame.**

In [20]:
std_performance_n = std_performance.set_index('StudentID')
std_performance_n.head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0


## **Q. Select a specific column and display its values.**

In [21]:
study_time_weekly_table = std_performance_n[['StudyTimeWeekly']]
study_time_weekly_table.head()

Unnamed: 0_level_0,StudyTimeWeekly
StudentID,Unnamed: 1_level_1
1001,19.833723
1002,15.408756
1003,4.21057
1004,10.028829
1005,4.672495


## **Q. Select multiple columns and display the resulting DataFrame.**

In [22]:
std_performance_n.columns

Index(['Age', 'Gender', 'Ethnicity', 'ParentalEducation', 'StudyTimeWeekly',
       'Absences', 'Tutoring', 'ParentalSupport', 'Extracurricular', 'Sports',
       'Music', 'Volunteering', 'GPA', 'GradeClass'],
      dtype='object')

In [23]:
std_performance_cols = std_performance_n[['Age', 'Gender', 'GPA', 'GradeClass']]
std_performance_cols.head(10)

Unnamed: 0_level_0,Age,Gender,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,17,1,2.929196,2.0
1002,18,0,3.042915,1.0
1003,15,0,0.112602,4.0
1004,17,1,2.054218,3.0
1005,17,1,1.288061,4.0
1006,18,0,3.084184,1.0
1007,15,0,2.748237,2.0
1008,15,1,1.360143,4.0
1009,17,0,2.896819,2.0
1010,16,1,3.573474,0.0


## *Alternatively, you can assign all your columns to a list variable and pass that variable to the indexing operator.*

In [24]:
cols = ['Age', 'Gender', 'GPA', 'GradeClass']
std_list = std_performance_n[cols]
std_list.head()

Unnamed: 0_level_0,Age,Gender,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,17,1,2.929196,2.0
1002,18,0,3.042915,1.0
1003,15,0,0.112602,4.0
1004,17,1,2.054218,3.0
1005,17,1,1.288061,4.0


## **Q. Select a subset of rows using the .loc method.**

In [25]:
std_performance_n.iloc[2] # selecting 3rd row

Age                  15.000000
Gender                0.000000
Ethnicity             2.000000
ParentalEducation     3.000000
StudyTimeWeekly       4.210570
Absences             26.000000
Tutoring              0.000000
ParentalSupport       2.000000
Extracurricular       0.000000
Sports                0.000000
Music                 0.000000
Volunteering          0.000000
GPA                   0.112602
GradeClass            4.000000
Name: 1003, dtype: float64

## *Select subset of rows where StudyTimeWeekly is greater than 10*

In [26]:
subset = std_performance_n.loc[std_performance_n['StudyTimeWeekly']>10, ['GPA', 'StudyTimeWeekly']]
subset.head()

Unnamed: 0_level_0,GPA,StudyTimeWeekly
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1
1001,2.929196,19.833723
1002,3.042915,15.408756
1004,2.054218,10.028829
1007,2.748237,15.60168
1008,1.360143,15.424496


## **Q. Select a subset of rows and columns using the .iloc method.**

In [27]:
rc_subset = std_performance_n.iloc[[1, 4, 7]]
rc_subset.head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0
1008,15,1,1,4,15.424496,22,1,1,1,0,0,0,1.360143,4.0


## **Q. Filter rows based on a condition.**

## *use of filter*

In [28]:
std_performance_n.filter(like='G').head()

Unnamed: 0_level_0,Gender,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1001,1,2.929196,2.0
1002,0,3.042915,1.0
1003,0,0.112602,4.0
1004,1,2.054218,3.0
1005,1,1.288061,4.0


## *Select subset of rows where 'GPA' is greater than 3.000000*

In [29]:
subset_gpa = std_performance_n.loc[std_performance_n['GPA']>3.000000, ['Age', 'Gender', 'Ethnicity', 'ParentalEducation', 'StudyTimeWeekly', 
                                         'Absences', 'Tutoring', 'ParentalSupport', 'Extracurricular', 'Sports', 
                                         'Music', 'Volunteering', 'GPA', 'GradeClass']]
subset_gpa.head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1006,18,0,0,1,8.191219,0,0,1,1,0,0,0,3.084184,1.0
1010,16,1,0,1,18.444466,0,0,3,1,0,0,0,3.573474,0.0
1039,15,1,1,1,2.949078,3,1,1,1,1,0,0,3.018906,1.0
1045,18,1,0,1,18.921512,1,1,3,1,1,0,0,4.0,0.0


In [30]:
s_gpa = std_performance_n.loc[std_performance_n['GPA']>3.000000, ['Age', 'StudyTimeWeekly', 'GPA','GradeClass']]
s_gpa.head()

Unnamed: 0_level_0,Age,StudyTimeWeekly,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1002,18,15.408756,3.042915,1.0
1006,18,8.191219,3.084184,1.0
1010,16,18.444466,3.573474,0.0
1039,15,2.949078,3.018906,1.0
1045,18,18.921512,4.0,0.0


## **Q. Group the DataFrame by a specific column and calculate the mean of each group.**

## *group by 'GradeClass' and calculating mean of each group*

In [31]:
grouped_mean = std_performance_n.groupby('GradeClass').mean(numeric_only=True)
grouped_mean.head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA
GradeClass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0.0,16.476636,0.457944,0.981308,1.700935,11.854926,5.747664,0.485981,2.682243,0.514019,0.35514,0.196262,0.140187,3.102942
1.0,16.460967,0.509294,0.869888,1.698885,11.122335,5.312268,0.405204,2.3829,0.420074,0.342007,0.260223,0.163569,3.001673
2.0,16.508951,0.496164,0.882353,1.662404,10.106404,7.250639,0.30179,2.104859,0.414322,0.28133,0.186701,0.148338,2.659742
3.0,16.449275,0.514493,0.94686,1.751208,9.757963,11.427536,0.272947,2.161836,0.359903,0.304348,0.195652,0.144928,2.215545
4.0,16.463254,0.519405,0.844756,1.786127,9.184822,20.786953,0.271676,2.006606,0.361685,0.297275,0.186623,0.164327,1.208041


## *but means of age, studytimeweekly, absences and gpa is preferrable*

In [32]:
grouped_mean_s = std_performance_n.groupby(['GradeClass'])[['Age', 'StudyTimeWeekly', 'Absences', 'GPA']].mean().reset_index()
grouped_mean_s.head()

Unnamed: 0,GradeClass,Age,StudyTimeWeekly,Absences,GPA
0,0.0,16.476636,11.854926,5.747664,3.102942
1,1.0,16.460967,11.122335,5.312268,3.001673
2,2.0,16.508951,10.106404,7.250639,2.659742
3,3.0,16.449275,9.757963,11.427536,2.215545
4,4.0,16.463254,9.184822,20.786953,1.208041


## **Q. Group the DataFrame by multiple columns and calculate the sum of each group.**

In [33]:
grouped_sum = std_performance_n.groupby(['Gender', 'GradeClass'])[['Age', 'StudyTimeWeekly', 'Absences', 'GPA']].sum().reset_index()
grouped_sum

Unnamed: 0,Gender,GradeClass,Age,StudyTimeWeekly,Absences,GPA
0,0,0.0,949,674.279053,461,164.665604
1,0,1.0,2167,1463.455158,692,399.175528
2,0,2.0,3221,1928.888212,1360,527.675733
3,0,3.0,3314,1943.073464,2231,448.813108
4,0,4.0,9557,5346.030446,12052,704.524334
5,1,0.0,814,594.198022,154,167.349166
6,1,1.0,2261,1528.453023,737,408.274518
7,1,2.0,3234,2022.715704,1475,512.283349
8,1,3.0,3496,2096.723071,2500,468.422482
9,1,4.0,10380,5776.788518,13121,758.413813


In [34]:
grouped_sum_2 = std_performance_n.groupby(['Gender', 'GradeClass']).sum(numeric_only=True)
grouped_sum_2

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA
Gender,GradeClass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0.0,949,55,97,674.279053,461,31,150,25,18,15,9,164.665604
0,1.0,2167,108,225,1463.455158,692,60,322,58,47,31,23,399.175528
0,2.0,3221,181,331,1928.888212,1360,60,416,79,55,34,37,527.675733
0,3.0,3314,178,345,1943.073464,2231,56,434,73,68,40,25,448.813108
0,4.0,9557,485,1037,5346.030446,12052,163,1150,217,172,107,90,704.524334
1,0.0,814,50,85,594.198022,154,21,137,30,20,6,6,167.349166
1,1.0,2261,126,232,1528.453023,737,49,319,55,45,39,21,408.274518
1,2.0,3234,164,319,2022.715704,1475,58,407,83,55,39,21,512.283349
1,3.0,3496,214,380,2096.723071,2500,57,461,76,58,41,35,468.422482
1,4.0,10380,538,1126,5776.788518,13121,166,1280,221,188,119,109,758.413813


## **Q. Use the agg method to apply multiple aggregation functions to grouped data.**

In [35]:
grp_agg = std_performance_n.groupby(['Gender', 'GradeClass']).agg({
    'Age': ['mean','sum','max','min'],
    'StudyTimeWeekly': ['mean','sum','max','min'],
    'Absences': ['mean','sum','max','min'],
    'GPA': ['mean','sum','max','min']
})
grp_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Age,Age,StudyTimeWeekly,StudyTimeWeekly,StudyTimeWeekly,StudyTimeWeekly,Absences,Absences,Absences,Absences,GPA,GPA,GPA,GPA
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,sum,max,min,mean,sum,max,min,mean,sum,max,min,mean,sum,max,min
Gender,GradeClass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
0,0.0,16.362069,949,18,15,11.625501,674.279053,19.88576,0.18505,7.948276,461,28,0,2.839062,164.665604,4.0,0.390756
0,1.0,16.416667,2167,18,15,11.086781,1463.455158,19.912084,0.393664,5.242424,692,25,0,3.024057,399.175528,3.498257,0.62112
0,2.0,16.350254,3221,18,15,9.791311,1928.888212,19.920256,0.001057,6.903553,1360,28,0,2.678557,527.675733,3.597766,0.557549
0,3.0,16.487562,3314,18,15,9.667032,1943.073464,19.710483,0.10698,11.099502,2231,29,0,2.232901,448.813108,4.0,0.528226
0,4.0,16.420962,9557,18,15,9.185619,5346.030446,19.933666,0.008031,20.707904,12052,29,0,1.210523,704.524334,3.482453,0.0
1,0.0,16.612245,814,18,15,12.12649,594.198022,19.765242,1.51109,3.142857,154,29,0,3.415289,167.349166,4.0,0.21457
1,1.0,16.50365,2261,18,15,11.156591,1528.453023,19.972346,0.018117,5.379562,737,29,0,2.980106,408.274518,3.572945,0.0
1,2.0,16.670103,3234,18,15,10.42637,2022.715704,19.833723,0.227793,7.603093,1475,28,0,2.640636,512.283349,3.262922,0.818126
1,3.0,16.413146,3496,18,15,9.84377,2096.723071,19.93981,0.087192,11.737089,2500,27,0,2.199167,468.422482,3.275903,0.0
1,4.0,16.502385,10380,18,15,9.184083,5776.788518,19.978094,0.004859,20.860095,13121,29,0,1.205745,758.413813,3.979421,0.0


## **Q. Calculate the size of each group.**

In [36]:
grp_size = std_performance_n.groupby(['Age', 'Gender', 'Ethnicity', 'ParentalEducation', 'StudyTimeWeekly',
       'Absences', 'Tutoring', 'ParentalSupport', 'Extracurricular', 'Sports',
       'Music', 'Volunteering', 'GPA', 'GradeClass']).size().reset_index(name='GroupSize')
grp_size

Unnamed: 0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass,GroupSize
0,15,0,0,0,0.451394,9,0,1,0,0,0,0,1.756186,4.0,1
1,15,0,0,0,3.206181,10,0,1,1,1,0,1,2.220051,2.0,1
2,15,0,0,0,4.225258,15,1,3,0,0,0,0,1.799531,4.0,1
3,15,0,0,0,4.430021,15,0,2,0,1,0,0,1.441410,4.0,1
4,15,0,0,0,6.013113,22,0,3,0,0,0,0,1.028184,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2387,18,1,3,3,13.414944,10,0,3,0,1,0,0,2.644194,2.0,1
2388,18,1,3,3,16.086407,20,0,1,0,0,1,0,1.287667,3.0,1
2389,18,1,3,4,5.758682,4,0,2,1,1,0,1,3.127453,1.0,1
2390,18,1,3,4,16.564255,21,0,2,0,1,1,0,1.729073,4.0,1


## **Q. Select rows based on multiple conditions.**

In [37]:
std_performance_n[std_performance_n['Age']<35].head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0


In [38]:
std_performance_n[std_performance_n['GradeClass']<3.0].head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1006,18,0,0,1,8.191219,0,0,1,1,0,0,0,3.084184,1.0
1007,15,0,1,1,15.60168,10,0,3,0,1,0,0,2.748237,2.0
1009,17,0,0,0,4.562008,1,0,2,0,1,0,1,2.896819,2.0


In [39]:
std_performance_n[std_performance_n['Gender']==0].head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
1006,18,0,0,1,8.191219,0,0,1,1,0,0,0,3.084184,1.0
1007,15,0,1,1,15.60168,10,0,3,0,1,0,0,2.748237,2.0
1009,17,0,0,0,4.562008,1,0,2,0,1,0,1,2.896819,2.0


In [40]:
std_performance_n[(std_performance_n['Age']<35) & (std_performance_n['Gender']==0)].head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
1006,18,0,0,1,8.191219,0,0,1,1,0,0,0,3.084184,1.0
1007,15,0,1,1,15.60168,10,0,3,0,1,0,0,2.748237,2.0
1009,17,0,0,0,4.562008,1,0,2,0,1,0,1,2.896819,2.0


In [41]:
std_performance_n[(std_performance_n['GPA']>2) & (std_performance_n['StudyTimeWeekly']>10)].head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
1007,15,0,1,1,15.60168,10,0,3,0,1,0,0,2.748237,2.0
1010,16,1,0,1,18.444466,0,0,3,1,0,0,0,3.573474,0.0


## **Q. Use the query method to filter rows.**

In [42]:
std_performance_n.query('Age < 18 and GradeClass < 3.0')

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1007,15,0,1,1,15.601680,10,0,3,0,1,0,0,2.748237,2.0
1009,17,0,0,0,4.562008,1,0,2,0,1,0,1,2.896819,2.0
1010,16,1,0,1,18.444466,0,0,3,1,0,0,0,3.573474,0.0
1021,16,1,0,3,2.621597,2,0,3,0,0,0,1,2.778411,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3382,15,0,2,0,10.095086,5,0,3,0,0,0,0,2.956255,0.0
3386,16,1,0,1,1.445434,20,0,3,1,1,0,0,1.395631,1.0
3390,16,1,0,2,6.805500,20,0,2,0,0,0,1,1.142333,2.0
3391,16,1,1,0,12.416653,17,0,2,0,1,1,0,1.803297,1.0


## **Q. Use 'isin' to filter rows based on a list of values.**

In [43]:
std_performance_n.loc[std_performance_n['Absences'].isin([5, 17, 20]), ['Tutoring', 'Absences']]

Unnamed: 0_level_0,Tutoring,Absences
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1
1005,1,17
1016,1,17
1024,0,20
1026,0,5
1030,0,20
...,...,...
3382,0,5
3383,0,20
3386,0,20
3390,0,20


In [44]:
std_performance_n.rename(columns={'GPA': 'GradePointAverage'}, inplace=True)
std_performance_ncol = std_performance_n[['Age', 'GradeClass', 'GradePointAverage']]
std_performance_ncol.head()

Unnamed: 0_level_0,Age,GradeClass,GradePointAverage
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1001,17,2.0,2.929196
1002,18,1.0,3.042915
1003,15,4.0,0.112602
1004,17,3.0,2.054218
1005,17,4.0,1.288061


In [45]:
std_performance_n['RaceEthnicity'] = std_performance_n['Ethnicity']
std_performance_n.head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GradePointAverage,GradeClass,RaceEthnicity
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0,0
1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0,0
1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0,2
1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0,0
1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0,0


## *new column with 'RaceEthnicity' is created, now drop 'Ethnicity'*

> The `inplace=True` parameter ensures that the operation modifies the DataFrame std_performance_n `directly`.

In [46]:
std_performance_n.drop(columns=['Ethnicity'], inplace=True)
std_performance_n.head()

Unnamed: 0_level_0,Age,Gender,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GradePointAverage,GradeClass,RaceEthnicity
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1001,17,1,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0,0
1002,18,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0,0
1003,15,0,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0,2
1004,17,1,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0,0
1005,17,1,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0,0


# **The End :)**