# Assignment 4

### Instructions:

Assignment 4 will cover the NumPy and Pandas packages. The overall goal of this assignment is to ensure that you are comfortable with the NumPy and Pandas packages. Follow the steps below to ensure that you receive a passing grade:

    - Complete the following tasks within this notebook
    - When finished, convert this notebook to an HTML/PDF file
    - Place the following items into a zip folder:
        - assignment.HTML or assignment.PDF
        - assignment.ipynb
        - assignment.csv
    - Name this folder using the naming convention: FIRSTNAME_LASTNAME_ASSIGNMENT4.zip
        - For example, if your name was Jane Doe, then the zipped folder should be called JANE_DOE_ASSIGNMENT4.zip
 

In [18]:
# import packages
import numpy as np
import pandas as pd

## 1. Manipulate an Array

Without using the `np.array()` function, manipulate (in any way you choose) a $2 \times 6$ array of zeros, to match the final $4 \times 3$ array shown below.

$$start \rightarrow \begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix} \rightarrow \text{manipulate} \rightarrow
\begin{bmatrix}
12 & 11 & 10\\
9 & 8 & 7\\
10 & 10 & 10\\
3 & 2 & 1\\
\end{bmatrix}$$

In [19]:
target = np.zeros(shape = (2,6))

#Initalizing the values that I want to add into my 4x3 matrix
values = [12, 11, 10, 9, 8, 7, 10, 10, 10, 3, 2, 1]
target = np.reshape(values, (4,3)).astype(dtype= 'i')

print(target)

[[12 11 10]
 [ 9  8  7]
 [10 10 10]
 [ 3  2  1]]


## 2. Create a DataFrame and Clean

**A. Create a `DataFrame` using the `pd.DataFrame()` function that is identical to the table below. Remember to use `np.nan` to signify missing values.**

| name | gender | weight_kg | height_cm 
| --- | --- | --- | --- |
| Jacinda | F | 55.51 | **NaN**
| Oralla | M | 60.4 |162.79
| Aeriell | F | **NaN** | 162.32
| Portie | X | 73.94 | 160.33
| Ferdinanda | F | 60.62 | 146.68


In [20]:
df = pd.DataFrame({
    'name': ['Jacinda', 'Oralla', 'Aeriell', 'Portie', 'Ferdinanda'],
    'gender': ['F', 'M', 'F', 'X', 'F'],
    'weight_kg': ['55.51', '60.4', np.nan, '73.94', '60.62'],
    'height_cm': [np.nan, '162.79', '162.32', '160.33', '146.68']
})

print(df)

         name gender weight_kg height_cm
0     Jacinda      F     55.51       NaN
1      Oralla      M      60.4    162.79
2     Aeriell      F       NaN    162.32
3      Portie      X     73.94    160.33
4  Ferdinanda      F     60.62    146.68


**B. Replace the missing value in the `weight_kg` column with the mean weight across the `weight_kg` column**

In [21]:
#splitting df down into 2 sub-series
first = df['weight_kg'].iloc[0:2]
end = df['weight_kg'].iloc[3:5]

#changing the values into a numeric-type
first_num = pd.to_numeric(first)
end_num = pd.to_numeric(end)

#mathematical calculations
first_sum = first_num.sum()
end_sum = end_num.sum()
final_sum = first_sum + end_sum
mean = (final_sum / 4).round(2)

#filling "weights_kg" column's missing value with the mean
new_weight = df['weight_kg'].fillna(mean)

print(new_weight)

0    55.51
1     60.4
2    62.62
3    73.94
4    60.62
Name: weight_kg, dtype: object


**C. Replace the missing value in the `height_cm` column with the maximum of the heights across the `height_cm` column**

In [22]:
#Preparing the necessary value positions
heightData = df['height_cm'].iloc[1:5]

#Deduces the maximum within the "height_cm" column
max = heightData.max()

#Filling in the NaN with max
newHeight = df['height_cm'].fillna(max)

print(newHeight)

0    162.79
1    162.79
2    162.32
3    160.33
4    146.68
Name: height_cm, dtype: object


**D. Replace the `X` value in the `gender` column with the mode of the `gender` column**

In [23]:
#Searching for the mode in the series "gender"
popGender = df['gender'].mode()

#replacing "X" to "F"
newGender = df['gender'].replace("X", "F")

print(newGender)

0    F
1    M
2    F
3    F
4    F
Name: gender, dtype: object


**E. Print the cleaned DataFrame**

In [24]:
df = pd.DataFrame({
    'name': ['Jacinda', 'Oralla', 'Aeriell', 'Portie', 'Ferdinanda'],
    'gender': ['F', 'M', 'F', 'F', 'F'],
    'weight_kg': ['55.51', '60.4', mean, '73.94', '60.62'],
    'height_cm': [max, '162.79', '162.32', '160.33', '146.68']
})

print(df)

         name gender weight_kg height_cm
0     Jacinda      F     55.51    162.79
1      Oralla      M      60.4    162.79
2     Aeriell      F     62.62    162.32
3      Portie      F     73.94    160.33
4  Ferdinanda      F     60.62    146.68


## 3. DataFrame Manipulation

Let's load in the `exercise` dataset from the `seaborn` package.

In [25]:
import seaborn as sns
exercise = sns.load_dataset("exercise") # this requries an internet connection
print(exercise)

    Unnamed: 0  id     diet  pulse    time     kind
0            0   1  low fat     85   1 min     rest
1            1   1  low fat     85  15 min     rest
2            2   1  low fat     88  30 min     rest
3            3   2  low fat     90   1 min     rest
4            4   2  low fat     92  15 min     rest
..         ...  ..      ...    ...     ...      ...
85          85  29   no fat    135  15 min  running
86          86  29   no fat    130  30 min  running
87          87  30   no fat     99   1 min  running
88          88  30   no fat    111  15 min  running
89          89  30   no fat    150  30 min  running

[90 rows x 6 columns]


### Perform the following tasks on the DataFrame:

**A. Remove the unnamed column from the DataFrame**

In [26]:
cleanExercise = exercise.drop(columns=["Unnamed: 0"])

print(cleanExercise)

    id     diet  pulse    time     kind
0    1  low fat     85   1 min     rest
1    1  low fat     85  15 min     rest
2    1  low fat     88  30 min     rest
3    2  low fat     90   1 min     rest
4    2  low fat     92  15 min     rest
..  ..      ...    ...     ...      ...
85  29   no fat    135  15 min  running
86  29   no fat    130  30 min  running
87  30   no fat     99   1 min  running
88  30   no fat    111  15 min  running
89  30   no fat    150  30 min  running

[90 rows x 5 columns]


**B. The `time` column is a string type. Use the `.str.split()` function to extract the numbers out of this column. Save these numbers as a new column called `minutes`. What is the data type of this new column? Should it be transformed to a different data type? If so, perform the transformation.**


In [35]:
cleanExercise.time.str.split(" ")
minutes = cleanExercise.time.str.split(" ").str[0]
#new_minutes = minutes.astype(int)

exercise["minutes"] = minutes.astype(int)

print(exercise)


    Unnamed: 0  id     diet  pulse    time     kind  minutes
0            0   1  low fat     85   1 min     rest        1
1            1   1  low fat     85  15 min     rest       15
2            2   1  low fat     88  30 min     rest       30
3            3   2  low fat     90   1 min     rest        1
4            4   2  low fat     92  15 min     rest       15
..         ...  ..      ...    ...     ...      ...      ...
85          85  29   no fat    135  15 min  running       15
86          86  29   no fat    130  30 min  running       30
87          87  30   no fat     99   1 min  running        1
88          88  30   no fat    111  15 min  running       15
89          89  30   no fat    150  30 min  running       30

[90 rows x 7 columns]


*The data type of minutes was "object", which doesn't make sense since we wanted minutes to be considered as "int"*

**C. Since we have the `minutes` column, drop the original `time` column from the DataFrame**

In [37]:
noTime = exercise.drop(columns=["time"])

print(noTime)

    Unnamed: 0  id     diet  pulse     kind  minutes
0            0   1  low fat     85     rest        1
1            1   1  low fat     85     rest       15
2            2   1  low fat     88     rest       30
3            3   2  low fat     90     rest        1
4            4   2  low fat     92     rest       15
..         ...  ..      ...    ...      ...      ...
85          85  29   no fat    135  running       15
86          86  29   no fat    130  running       30
87          87  30   no fat     99  running        1
88          88  30   no fat    111  running       15
89          89  30   no fat    150  running       30

[90 rows x 6 columns]


**D. Subset the DataFrame to only contain entries where the `minutes` column is greater than `10`. Use the `.min()` function to determine what is the minimum pulse is for activities that lasted more than `10` minutes**

In [42]:
subset = noTime[noTime['minutes'] > 10]

print(subset)

    Unnamed: 0  id     diet  pulse     kind  minutes
1            1   1  low fat     85     rest       15
2            2   1  low fat     88     rest       30
4            4   2  low fat     92     rest       15
5            5   2  low fat     93     rest       30
7            7   3  low fat     97     rest       15
8            8   3  low fat     94     rest       30
10          10   4  low fat     82     rest       15
11          11   4  low fat     83     rest       30
13          13   5  low fat     92     rest       15
14          14   5  low fat     91     rest       30
16          16   6   no fat     83     rest       15
17          17   6   no fat     84     rest       30
19          19   7   no fat     88     rest       15
20          20   7   no fat     90     rest       30
22          22   8   no fat     94     rest       15
23          23   8   no fat     95     rest       30
25          25   9   no fat     99     rest       15
26          26   9   no fat     96     rest   

**E. Use the `groupby()` function to group the `kind` column. Then aggregate to find the mean pulse for each `kind` of activity in the DataFrame**

In [52]:
grouped = noTime.groupby('kind')
mean_pulse = grouped['pulse'].mean()

print(mean_pulse)

kind
rest        90.833333
walking     95.200000
running    113.066667
Name: pulse, dtype: float64


  grouped = noTime.groupby('kind')


**F. Save the DataFrame as a CSV file named `assignment.csv`. Make sure that the `index` argument is set to `False`**

In [53]:
noTime.to_csv("assignment.csv", index = False)