# HW01
Welcome to the `assignment 1` of Introduction to Data Analysis.

In this assignment, you will practice basics of python, numpy and pandas.

Please follow the instructions below:

1. Write down your code between <br>
&nbsp;&nbsp;&nbsp;&nbsp; **\# BEGIN_YOUR_CODE**<br>
&nbsp;&nbsp;&nbsp;&nbsp; and<br>
&nbsp;&nbsp;&nbsp;&nbsp; **\# END_YOUR_CODE**.

2. Do not use **external libraries**. (i.e., Do not use any `import` in your code) <br>
   Your code will fail to execute and get **0 score** if you use them.

3. Rename this file to **[student_id].ipynb** (e.g. 20230000.ipynb) and submit it to PLMS. <br>
   There is **30% penalty** if you do not follow the submission format.

4. Submission late is not accepted.
   You will get **No score** for late submission. 

In [1]:
import numpy as np
import pandas as pd

## Problem 1. Factorial [2 points]
Given `n`, implement function `factorial(n)` that calculates `n!`. <br>
Assume `n` is zero or a positive integer. Just so you know, `0!` is defined as `1`.<br>


In [2]:
def factorial(n):
    # BEGIN_YOUR_CODE
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n-1)
    # END_YOUR_CODE

## Problem 2. Frequent Word Count [2 points]
Implement a function `freq_word_count(filename)` that takes a filename as input. <br>
This function should open the specified file and determine which words appear most frequently. It should return a `tuple` where the first item is the most frequent word and the second is the count of that word.<br>
If there are multiple words with the same maximum frequency, you can return any one of them.


In [3]:
def freq_word_count(filename):
    '''
    filename: .txt file
    '''
    # BEGIN_YOUR_CODE
    with open(filename, 'r') as f:
        flines = f.readlines()

    word_cnt = {}
    for line in flines:
        words = line.strip().split()
        
        for word in words:
            if word in word_cnt:
                word_cnt[word] += 1
            else:
                word_cnt[word] = 1

    most_freq_word = max(word_cnt, key=word_cnt.get)
    most_freq_cnt = word_cnt[most_freq_word]

    return (most_freq_word, most_freq_cnt)
    # END_YOUR_CODE

## Problem 3. Median Score [3 points]
Given `students`, implement function `median_score(students)` that returns a dictionary with each student_id as a key and the median score as the value.<br> If the number of scores for a student is odd, the median is the middle score. If the number of scores is even, the median is the average of the two middle scores.

In [4]:
class Student:
    def __init__(self, args):
        name, student_id, grade =  args
        self.name = name
        self.student_id = student_id
        self.grade = grade
    
def median_score(students):
    '''
    Args:
    students (dict): list of Student instances

    Returns:
    result (dict): A dictionary where each key is a student_id and the value is the median score.
    '''
    # BEGIN_YOUR_CODE
    result = {}
    
    for student in students:
        sorted_grades = sorted(student.grade)
        n = len(sorted_grades)

        if n % 2 == 1:
            median = sorted_grades[n//2]
        else:
            median = (sorted_grades[n//2 - 1] + sorted_grades[n//2]) / 2

        result[student.student_id] = median
        
    # END_YOUR_CODE    
    return result

## Problem 4. Vector Norm [2 points]
Given two numpy arrays arr1 and arr2, and an integer n, implement a function `vector_norm` that calculates the n-th vector norm of the difference between arr1 and arr2. <br>
You need to utilize the numpy library for this problem.

In [5]:
def vector_norm(arr1, arr2, n):
    # BEGIN_YOUR_CODE
    diff = arr1 - arr2
    norm = np.linalg.norm(diff, ord=n)
    return norm
    # END_YOUR_CODE

## Problem 5. CSV Modification [5 points]
Your goal is to modify given csv file with below constraints. <br>
The inputs are paths of the original data and modified data. <br>
You need to utilize pandas library for this problem.

### Constraints
- The requirements must be followed in the same order as given below.<br>
  (If not, you might attain different results although you followed everything correctly.)
1. The modified csv file should contain rows where "Active" cases are greater than 10,000.
2. The modified csv file should only have `Europe` region.
3. Add new columns called "Mortality_Rate" and "Recovery_Rate", which are calculated as `(Deaths / Confirmed) * 100`, `(Recovered / Confirmed) * 100`.
4. Sort the data by "Mortality Rate (%)" in descending order


In [6]:
import pandas as pd

def covid(original_file, modified_file):
    df = pd.read_csv(original_file)
    
    # BEGIN_YOUR_CODE
    # Contain rows where "Active" cases are greater than 10,000
    df = df[df['Active'] > 10000]
    # Only have "Europe" region
    df = df[df['WHO Region'] == 'Europe']

    # New columns "Mortality_Rate" and "Recovery_Rate"
    df['Mortality_Rate'] = (df['Deaths'] / df['Confirmed']) * 100
    df['Recovery_Rate'] = (df['Recovered'] / df['Confirmed']) * 100

    # Sort by "Mortality_Rate" in descending order
    df = df.sort_values(by='Mortality_Rate', ascending=False)
    # END_YOUR_CODE
    df.to_csv(modified_file, index=False)
    return df

## Problem 6. Employee and Department [6 points]
For this problem, three csv files, `departments.csv`, `employees.csv` and `employees2.csv`, are given. <br>
There are 2 sub problems. <br>
You need to utilize pandas library for this problem.

### 6.a Employee Table [3 points]
Make employee table that has `name`, `salary` and `department_name` as columns. <br>
Note that each department has its own `department_id` and `department_name`.

In [7]:
def emp_table(dep, emp1, emp2):
    # BEGIN_YOUR_CODE
    dep_df = pd.read_csv(dep)
    emp1_df = pd.read_csv(emp1)
    emp2_df = pd.read_csv(emp2)

    emp_df = pd.concat([emp1_df, emp2_df])
    df = emp_df.merge(dep_df, on='department_id', how='left')
    df = df[['name', 'salary', 'department_name']]
    # END_YOUR_CODE
    return df

### 6.b Highest Average Salary [3 points]
Find the department that has the highest average salary.<br>
The output should be a dictionary with the `department_name` as the key and its highest average salary as the value. <br>
You can use the `emp_table` provided in 6.a.

In [8]:
def highest_avg_salary(dep, emp1, emp2):
    # BEGIN_YOUR_CODE
    _emp_table = emp_table(dep, emp1, emp2)
    avgsal_by_dep = _emp_table.groupby('department_name')['salary'].mean()

    highest_avgsal_dep = avgsal_by_dep.idxmax()
    highest_avg_salary = avgsal_by_dep.max()
    
    return {highest_avgsal_dep: highest_avg_salary}
    # END_YOUR_CODE