#  Applied Machine Learning 

## Homework 1: Programming with Python  



### About this assignment:
The main purpose of this assignment is to check whether your programming knowledge is adequate to take this class. This assignment covers two python packages, `numpy` and `pandas`, which we'll be using throughout the course. For some of you, Python/numpy/pandas will be familiar; for others, it will be new. Either way, if you find this assignment very difficult then that could be a sign that you will struggle later on in the course. 

Also, as part of this assignment you will likely need to consult the documentation for various Python packages we're using. This is, of course, totally OK and in fact strongly encouraged. Reading and interpreting documentation is an important skill, and in fact is one of the skills this assignment is meant to assess. 

Imports
------

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Points
------

Each question or sub-question will have a number of points allocated to it, which is indicated right below the question name. 

<br><br>

## Exercise 1: Loading files with Pandas
rubric={points:12}

When working with tabular data, you will typically be creating Pandas dataframes by reading data from .csv files using `pd.read_csv()`. The documentation for this function is available [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In the "data" folder in this homework repository there are 6 different .csv files named `wine_#.csv/.txt`. Look at each of these files and use `pd.read_csv()` to load these data so that they resemble the following:

| Bottle | Grape | Origin | Alcohol | pH | Colour | Aroma |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 |  Chardonnay | Australia | 14.23 | 3.51 | White | Floral |
| 2 |  Pinot Grigio | Italy | 13.20 | 3.30 | White | Fruity |
| 3 |  Pinot Blanc | France | 13.16 | 3.16 | White | Citrus |
| 4 |  Shiraz | Chile | 14.91 | 3.39 | Red | Berry |
| 5 |  Malbec | Argentina | 13.83 | 3.28 | Red | Fruity |

You are provided with tests that use `df.equals()` to check that all the dataframes are identical. If you're in a situation where the two dataframes look identical but `df.equals()` is returning `False`, it may be an issue of types - try checking `df.index`, `df.columns`, or `df.info()`.

In [11]:
df1 = pd.read_csv("./data/wine_1.csv")
df2 = None
df3 = None
df4 = None
df5 = None
df6 = None

In [12]:
df1

Unnamed: 0,Bottle,Grape,Origin,Alcohol,pH,Colour,Aroma
0,1,Chardonnay,Australia,14.23,3.51,White,Floral
1,2,Pinot Grigio,Italy,13.2,3.3,White,Fruity
2,3,Pinot Blanc,France,13.16,3.16,White,Citrus
3,4,Shiraz,Chile,14.91,3.39,Red,Berry
4,5,Malbec,Argentina,13.83,3.28,Red,Fruity


In [13]:
for i, df in enumerate([df2, df3, df4, df5, df6]):
    assert df1.equals(df), f"df1 not equal to df{i + 2}"
print("All tests passed.")

AssertionError: df1 not equal to df2

In [None]:
df

<br><br>

## Exercise 2: The Titanic dataset

The file *titanic.csv* contains data of 1309 passengers who were on the Titanic's unfortunate voyage. For each passenger, the following data are recorded:

* survival - Survival (0 = No; 1 = Yes)
* class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* name - Name
* sex - Sex
* age - Age
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* ticket - Ticket Number
* fare - Passenger Fare
* cabin - Cabin
* embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
* boat - Lifeboat (if survived)
* body - Body number (if did not survive and body was recovered)

In this exercise you will perform a number of wrangling operations to manipulate and extract subsets of the data.

Note: many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.

#### 2(a)
rubric={points:1}

Load the `titanic.csv` dataset into a pandas dataframe named `titanic_df`.

In [None]:
titanic_df = pd.read_csv("data/titanic.csv")
titanic_df.head()

In [None]:
assert set(titanic_df.columns) == set(
    [
        "pclass",
        "survived",
        "name",
        "sex",
        "age",
        "sibsp",
        "parch",
        "ticket",
        "fare",
        "cabin",
        "embarked",
        "boat",
        "body",
        "home.dest",
    ]
), "All required columns are not present"
assert len(titanic_df.index) == 1309, "Wrong number of rows in dataframe"
print("Success")

#### 2(b)
rubric={points:2}

The column names `sibsp` and `parch` are not very descriptive. Use `df.rename()` to rename these columns to `siblings_spouses` and `parents_children` respectively.

In [None]:
titanic_df.rename(columns = {"sibsp":"siblings_spouses","parch":"parents_children"},inplace=True)

In [None]:
assert set(["siblings_spouses", "parents_children"]).issubset(
    titanic_df.columns
), "Column names were not changed properly"
print("Success")

#### 2(c)
rubric={points:2}

We will practice indexing different subsets of the dataframe in the following questions.

Select the column `age` using single bracket notation `[]`. What type of object is returned?

In [None]:
titanic_df["age"]

#### It returned a object that is a Series Type

#### 2(d)
rubric={points:2}

Now select the `age` using double bracket notation `[[]]`. What type of object is returned?

In [None]:
titanic_df[["age"]]

#### It returned a object that is a DataFrame Type

#### 2(e)
rubric={points:1}

Select the columns `pclass`, `survived`, and `age` using a single line of code.

In [None]:
titanic_df[["pclass","survived","age"]]

#### 2(f)
rubric={points:2}

Use the `iloc` method to obtain the first 5 rows of the columns `name`, `sex` and `age` using a single line of code.

In [None]:
titanic_df.iloc[0:5,2:5]

#### 2(g)
rubric={points:2}

Now use the `loc` method to obtain the first 5 rows of the columns `name`, `sex` and `age` using a single line of code.

In [None]:
titanic_df.loc[0:4,"name":"age"]

#### 2(h)
rubric={points:2}

How many passengers survived (`survived = 1`) the disaster? Hint: try using `df.query()` or `[]` notation to subset the dataframe and then `df.shape` to check its size.

In [None]:
survived_df = titanic_df.query("survived == 1")
survived_df.head()

In [None]:
len(survived_df)

#### 2(i)
rubric={points:1}

How many passengers that survived the disaster (`survived = 1`) were over 60 years of age?

In [None]:
len(survived_df.query("age>60"))

#### 2(j)
rubric={points:2}

What was the lowest and highest fare paid to board the titanic? Store your answers as floats in the variables `lowest` and `highest`.

In [None]:
lowest = titanic_df.fare.min()
highest = titanic_df.fare.max()
lowest, highest


#### 2(k)
rubric={points:1}

Sort the dataframe by fare paid (most to least).

In [None]:
sorted_df = titanic_df.sort_values(by="fare",ascending=False)
sorted_df.head()

#### 2(l)
rubric={points:1}

Save the sorted dataframe to a .csv file called 'titanic_fares.csv' using `to_csv()`.

In [None]:
sorted_df.to_csv("titanic_fares.csv")

#### 2(m)
rubric={points:3}

Create a scatter plot of fare (y-axis) vs. age (x-axis). Make sure to follow the [guidelines on figures](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#figures). You are welcome to use pandas built-in plotting or `matplotlib`. 

In [None]:
plt.scatter(x=titanic_df.age, y= titanic_df.fare)
plt.xlabel("Age")
plt.ylabel("Fare")
plt.title("Age VS Fare Scatter Plot")

#### 2(n)
rubric={points:3}

Create a bar plot of `embarked` values. 

> Make sure to name the axes and give a title to your plot. 

In [None]:
import seaborn as sns


In [None]:
sns.countplot(data=titanic_df,x="embarked")
plt.title("Bar Plot of Embarked")
# plt.legend()

<br><br>

## Exercise 3: Treasure Hunt

In this exercise, we will generate various collections of objects either as a list, a tuple, or a dictionary. Your task is to inspect the objects and look for treasure, which in our case is a particular object: **the character "T"**. 

**Your tasks:**

For each of the following cases, index into the Python object to obtain the "T" (for Treasure). 

> Please do not modify the original line of code that generates `x` (though you are welcome to copy it). You are welcome to answer this question "manually" or by writing code - whatever works for you. However, your submission should always end with a line of code that prints out `'T'` at the end (because you've found it). 

In [14]:
import string

letters = string.ascii_uppercase
letters

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

The first one is done for you as an example.

#### Example question

In [15]:
x = ("nothing", {-i: l for i, l in enumerate(letters)})
x

('nothing',
 {0: 'A',
  -1: 'B',
  -2: 'C',
  -3: 'D',
  -4: 'E',
  -5: 'F',
  -6: 'G',
  -7: 'H',
  -8: 'I',
  -9: 'J',
  -10: 'K',
  -11: 'L',
  -12: 'M',
  -13: 'N',
  -14: 'O',
  -15: 'P',
  -16: 'Q',
  -17: 'R',
  -18: 'S',
  -19: 'T',
  -20: 'U',
  -21: 'V',
  -22: 'W',
  -23: 'X',
  -24: 'Y',
  -25: 'Z'})

**Example answer**:

In [16]:
x[1][-19]

'T'

> Note: In these questions, the goal is not to understand the code itself, which may be confusing. Instead, try to probe the types of the various objects. For example `type(x)` reveals that `x` is a tuple, and `len(x)` reveals that it has two elements. Element 0 just contains "nothing", but element 1 contains more stuff, hence `x[1]`. Then we can again probe `type(x[1])` and see that it's a dictionary. If you `print(x[1])` you'll see that the letter "T" corresponds to the key -19, hence `x[1][-19]`.

#### 3(a)
rubric={points:2}

In [17]:
# Do not modify this cell
x = [
    [letters[i] for i in range(26) if i % 2 == 0],
    [letters[i] for i in range(26) if i % 2 == 1],
]
x

[['A', 'C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S', 'U', 'W', 'Y'],
 ['B', 'D', 'F', 'H', 'J', 'L', 'N', 'P', 'R', 'T', 'V', 'X', 'Z']]

In [18]:
x[1][9]

'T'

#### 3(b)
rubric={points:2}

In [21]:
# Do not modify this cell
np.random.seed(1)
x = np.random.choice(list(set(letters) - set("T")), size=(100, 26), replace=True)
x[np.random.randint(100), np.random.randint(26)] = "T"

In [31]:
np.where(x=="T") # For Finding Index of Array

array([95], dtype=int64)

In [32]:
x[95] # We have a 1-D Array containing "T"

array(['E', 'V', 'T', 'R', 'Q', 'P', 'O', 'M', 'E', 'C', 'W', 'C', 'C',
       'Q', 'E', 'C', 'G', 'D', 'A', 'W', 'X', 'B', 'L', 'I', 'C', 'K'],
      dtype='<U1')

In [33]:
x[95][2]

'T'

#### 3(c)
rubric={points:3}

In [36]:
# Do not modify this cell
n = 26
x = dict()
for i in range(n):
    x[string.ascii_lowercase[i]] = {
        string.ascii_lowercase[(j + 1) % n]: [[letters[j]] if j - 2 == i else None]
        for j in range(n)
    }

In [48]:
keys_list = list(x.keys())
keys_list[19].upper()

'T'

<br><br><br><br>

![](eva-congrats.png)