# Exercise 1

In this exercise, we will practice `apply()` in `pandas`. In pandas, the `apply()` function is used to apply a custom function along either axis (rows or columns) of a DataFrame. It allows you to perform operations on entire rows or columns rather than element-by-element, making it powerful for tasks like data transformation, aggregation, or creating new features. You can pass built-in functions or define your own, and it works flexibly with both series and DataFrames, enabling more complex operations than basic vectorized methods. However, it may be slower than alternatives like vectorization for large datasets.

### Exercise 1(a) (2 points)

Load the `pandas` library.

In [1]:
import pandas as pd

### Exercise 1(b) (2 points)

Read the `student_alcohol_consumption.csv` data file and create a data frame called `student_df`.

In [2]:
student_df = pd.read_csv('student_alcohol_consumption.csv')
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Exercise 1(c) (2 points)

From the `student_df` only keep the first 11 columns.

In [4]:
student_df = student_df.iloc[:,0:11]
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course
1,GP,F,17,U,GT3,T,1,1,at_home,other,course
2,GP,F,15,U,LE3,T,1,1,at_home,other,other
3,GP,F,15,U,GT3,T,4,2,health,services,home
4,GP,F,16,U,GT3,T,3,3,other,other,home


### Exercise 1(d) (2 points)

Lambda functions in Python are small, anonymous functions defined with the `lambda` keyword. They can have any number of input arguments but only a single expression, which is evaluated and returned. Lambda functions are often used for short, simple operations where defining a full function with `def` would be excessive, such as in cases where functions are needed temporarily—like within `map()`, `filter()`, or `apply()` in pandas. While convenient for concise logic, they are limited in readability and functionality compared to regular functions.

Create a lambda function that will capitalize strings.

In [5]:
capitalizer = lambda x: x.upper()

### Exercise 1(e) (2 points)

Capitalize both `Mjob` and `Fjob`.

In [6]:
student_df['Mjob'] = student_df['Mjob'].apply(capitalizer)
student_df['Fjob'] = student_df['Fjob'].apply(capitalizer)
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason
0,GP,F,18,U,GT3,A,4,4,AT_HOME,TEACHER,course
1,GP,F,17,U,GT3,T,1,1,AT_HOME,OTHER,course
2,GP,F,15,U,LE3,T,1,1,AT_HOME,OTHER,other
3,GP,F,15,U,GT3,T,4,2,HEALTH,SERVICES,home
4,GP,F,16,U,GT3,T,3,3,OTHER,OTHER,home


### Exercise 1(f) (4 points)

Create a function called majority that returns a boolean value to a new column called `legal_drinker` (consider majority as older than 17 years old).

In [7]:
def majority(x):
    if x > 17:
        return True
    else:
        return False
    
student_df['legal_drinker'] = student_df['age'].apply(majority)
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,legal_drinker
0,GP,F,18,U,GT3,A,4,4,AT_HOME,TEACHER,course,True
1,GP,F,17,U,GT3,T,1,1,AT_HOME,OTHER,course,False
2,GP,F,15,U,LE3,T,1,1,AT_HOME,OTHER,other,False
3,GP,F,15,U,GT3,T,4,2,HEALTH,SERVICES,home,False
4,GP,F,16,U,GT3,T,3,3,OTHER,OTHER,home,False


# Exercise 2 

In this exercise, we will go over an object-oriented programming exercises related to standard operations with `pandas` data frames. 

### Exercise 2(a) (10 points)

Create a class called `DataFrameProcessor` that will handle basic operations on a pandas DataFrame. Your class should:

1. Initialize with a pandas DataFrame.
2. Have a method get_column_stats(column_name) that returns the mean, median, and standard deviation of the specified column.
3. Have a method filter_rows(column_name, threshold) that returns a new DataFrame with rows where the values in the specified column are greater than the given threshold.
4. Have a method add_new_column(column_name, data) that adds a new column with the specified name and data to the DataFrame.


In [8]:
class DataFrameProcessor:
    def __init__(self, df):
        # initalize the class with dataframe object
        self.df = df
        
    def get_column_stats(self, column_name):
        # get mean, median, and std of a column
        column = self.df[column_name]
        return {'mean': column.mean(),
                'median': column.median(),
                'std': column.std()}
    
    def filter_rows(self, column_name, value):
        # filter df to only include rows where the column value is > the given value
        return self.df[self.df[column_name]>value]
    
    def add_new_column(self, column_name, values):
        #add new column to the df
        self.df[column_name] = values
        return self.df

### Exercise 2(b) (3 points)

Create a data frame called `df` with three columns and 20 rows. Fill the data frame with random integers from 1 to 20. Hint: use `np.random.randint` from `numpy` to generate the random integers.

In [10]:
import numpy as np

df = pd.DataFrame({'col_1': np.random.randint(1,20,20),
                   'col_2': np.random.randint(1,20,20),
                   'col_3': np.random.randint(1,20,20)})
df

Unnamed: 0,col_1,col_2,col_3
0,6,16,1
1,7,10,15
2,9,9,3
3,10,2,18
4,19,15,18
5,15,11,12
6,4,12,1
7,17,7,9
8,10,14,11
9,15,15,16


### Exercise 2(c) (4 points)

Using the `DataFrameProcessor` class, report the stats for each of the columns in `df`.

In [11]:
# create instance of class
df_processor = DataFrameProcessor(df)

# get mean median and std for columns
print('col 1 stats are:', df_processor.get_column_stats('col_1'))
print('col 2 stats are:', df_processor.get_column_stats('col_2'))
print('col 3 stats are:', df_processor.get_column_stats('col_3'))

col 1 stats are: {'mean': 10.3, 'median': 9.5, 'std': 5.620451377265934}
col 2 stats are: {'mean': 9.7, 'median': 10.5, 'std': 5.478186407543054}
col 3 stats are: {'mean': 10.1, 'median': 11.0, 'std': 5.90182844103151}


### Exercise 2(d) (3 points)

Filter the rows where `col_3` is greater than 12.

In [12]:
df_processor.filter_rows('col_3', 12)

Unnamed: 0,col_1,col_2,col_3
1,7,10,15
3,10,2,18
4,19,15,18
9,15,15,16
11,17,4,18
16,4,3,13
18,15,18,17


### Exercise 2(e) (3 points)

Create a list with random intergers from 1 to 20. Add the list to `df` as new column called `col_4`.

In [13]:
new_data = np.random.randint(1,20, 20).tolist()

df_processor.add_new_column('col_4', new_data)


Unnamed: 0,col_1,col_2,col_3,col_4
0,6,16,1,2
1,7,10,15,16
2,9,9,3,1
3,10,2,18,2
4,19,15,18,16
5,15,11,12,1
6,4,12,1,17
7,17,7,9,3
8,10,14,11,13
9,15,15,16,10
