# Homework 02: Exploratory Data Analysis and Data Wrangling 

Welcome to Homework 02! In this homework you will practice using functions and methods from the `pandas` library.

**Due Date:** 

**Collaboration Policy:** You are not allowed to discuss this assignment with other students. If you have questions please refer them to your instructor.

**Notes:**

- we use data from `classics.csv` file which is part of The CORGIS Project (The Collection of Really Great, Interesting, Situated Datasets) (see documentation [here](https://corgis-edu.github.io/corgis/csv/classics/)). 

**Warning:** Try not to delete the instructions or the questions of the assignment.

In the markdown cell below enter your name, section, and the date.

**Name:** 

**Section:** 

**Date:**

Let's get started!

In order to complete the tasks in this assignment you will need to `import` the necessary modules.

**Question 1.** Import the `NumPy`, `pandas` and `re` packages.

**Note:** Be sure to use appropriate aliasing.

In [None]:
import pandas as pd

In [None]:
classics = pd.read_csv('data/classics.csv')
classics[classics['bibliography.author.name'] == 'Homer']['bibliography.title'].to_frame()

## Exploratory Data Analysis

### What is exploratory data analysis?

>Exploratory data analysis is closely associated with John Tukey, of Princeton University and Bell Labs. Tukey proposed exploratory data analysis in 1961, and wrote a book about it in 1977. Tukey’s interest in exploratory data analysis influenced the development of the S statistical language at Bell Labs, which later led to S-Plus and R. 

**Source:** [Data Wrangling and Exploratory Data Analysis Explained](https://www.infoworld.com/article/3612888/data-wrangling-and-exploratory-data-analysis-explained.html)

**Question 2.** Read in the dataset `classics.csv` and save it to a `pandas` `DataFrame` object.

**Note:** The dataset is located in the data folder.

In [None]:
classics = ...

**Question 2a.** After you load the dataset display the first 10 rows only.

In [None]:
...

**Question 3.** Use the `.columns` methods to explore the dataset. 

In [None]:
...

**Question 3a.** Use the code cell below to write a Python expression that returns a column that contains categroical data as a `Series`.

In [None]:
...

**Question 3b.** Use the code cell below to write a Python expression that returns a column that contains numerical data as a `DataFrame`.

In [None]:
...

**Question 4.** Use the next four code cells to explore the `classics` dataset. After your exploration is complete, use the markdown cell to summarize you exploratory data analysis in paragraph form. Be sure to include comments int he code cells to explain your thought processes as you explore.

**Note:** Each code cell must be used during your exploratory data analysis. Make sure that the intent of each cell is different. or example, finding the mean, median, and standard deviation of a column will be considered as finding numerical summaries. Even though the numerical values represent different measures of central tendency and spread, they all fall under the same category of numerical summaries. If you are not sure please contact your instructor. 

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

_Type you paragraph here._

**Question 5.** Choose a year between the earliest and latest author birth year in the dataset. Then, write an expression to determine if each author birth year is greater than a set year that you chose.

In [None]:
...

**Question 6.** Create a new data frame named `classics_100` containing only books written by authors within the past 100 years.

In [None]:
classics_100 = ...

**Question 7.** How many authors are there?

**Note:** Use the code cell below to determine this value programmatically.

In [None]:
...

**Question 7a.** Type your answer (the number of books) from **Question 7** in the Markdown cell below.

_Type your answer from Question 5 a. in this cell._

**Question 8.** Who is the most ancient author?

**Note:** Use the code cell below to determine this value programmatically.

In [None]:
...

Type your answer (the author's name) from **Question 8.** in the Markdown cell below.

_Type your answer from Question 8. in this cell._

**Question 8a.** How many books did the most ancient author write?

**Note:** Use the code cell below to determine this value programmatically.

In [None]:
...

**Question 8b.** Type your answer (the number of books) from **Question 8a.** in the Markdown cell below.

_Type your answer from Question 6 a. in this cell._

## Data Wrangling

### What is data Wrangling?

>Data rarely comes in usable form. It's often contaminated with errors and omissions, rarely has the desired structure, and usually lacks context. Data wrangling is the process of discovering the data, cleaning the data, validating it, structuring it for usability, enriching the content (possibly by adding information from public data such as weather and economic conditions), and in some cases aggregating and transforming the data.

**Source:** [Data Wrangling and Exploratory Data Analysis Explained](https://www.infoworld.com/article/3612888/data-wrangling-and-exploratory-data-analysis-explained.html)

**Question 9.** Create a dictionary named `auth_dict` that maps authors (`key`) to their books (`value`). There are some authors with multiple books, but dictionaries should only have one key for each author. 

**Note:** There are multiple ways to do this.

In [None]:
#auth_dict = ...
drop_duplicates = classics.drop_duplicates(subset=['bibliography.title'],keep='last').reset_index()
auth_dict = {}
for i in range(len(drop_duplicates['metadata.id'])):
    if drop_duplicates['bibliography.author.name'][i] not in auth_dict.keys():
        auth_dict[drop_duplicates['bibliography.author.name'][i]] = [drop_duplicates['bibliography.title'][i]]
    else:
        h = auth_dict[drop_duplicates['bibliography.author.name'][i]]
        h.append(drop_duplicates['bibliography.title'][i])
        auth_dict[drop_duplicates['bibliography.author.name'][i]] = h

x = 0
for k, v in auth_dict.items():
    print(f'{k}: {v}')
    
    x += 1
    if x == 9:
        break

In [None]:
3

**Question 10.** The column `bibliography.subjects` contains the subjects for each story. Some of these are missing, some are separated by semi-colons, and some are separated by commas. Because of this, create two expressions to turn a comma-separated and semi-colon-separated string into lists, respectively. 

**Hint:** You may use the strings below. We will return to these!

In [None]:
# Run this cell to see an example
# of the comma-separated string
classics['bibliography.subjects'][0]

**Question 10a.** Put your expression that turns a comma-separated string into a list in the code cell below.

In [None]:
import re
comma_to_list = r'\s*,\s*'
re.split(comma_to_list, classics['bibliography.subjects'][0])

In [None]:
import numpy as np

title_subj = {}
for i in range(len(classics)):
    if pd.isna(classics["bibliography.subjects"][i]):
        continue
    title_subj[classics["bibliography.title"][i]] = classics["bibliography.subjects"][i].split(",")
    
title_subj['Pride and Prejudice']

In [None]:
# Run this cell to see an example
# of the semi-colon-separated string
classics['bibliography.subjects'][3]

**Question 10b.** Put your expression that turns a semi-colon-separated string into a list in the code cell below.

In [None]:
...

**Question 11.** Using your answer to **Question 9.**, a `for` loop, and conditional statements, create a new dictionary that has the keys of book titles and values that are the list of the subjects. Save this do a dictionary object named `title_subj`.

**Notes:**

* Iterating through rows is not great, but we do it here in light of what we know to date. If you prefer to use another method, like writing a function, you are free to do so.

* If the subject is null, replace with `["About nothing"]`.

In [None]:
title_subj = ...

...

**Question 12.** Using the dictionary `title_subj`, find all unique subjects and save this a list, `subjects`. Prove that you have unique entries using sets.

In [None]:
subjects = set()

for title in title_subj:
    for subject in title_subj[title]:
        subjects.add(subject)
subjects = list(subjects)
len(subjects)

In [None]:
title_subj = {}
for i in range(len(classics)):
    if (str(classics['bibliography.subjects'][i]) == "nan"):
        title_subj.update({classics['bibliography.title'][i]: ["About nothing"]})
    else:
        title_subj.update({classics['bibliography.title'][i]: classics['bibliography.subjects'][i]})

In [None]:
subjects = []
for i in title_subj.values():
    subjects.extend(i)
subjects = list(set(subjects))

unique_subjects = set(subjects)
if len(unique_subjects) == len(subjects):
    print("The list contains only unique entries.")
else:
    print("The list contains duplicate entries.")

In [None]:
subject = input("What subject do you like: ").lower()
titles = []
for i in range(len(list(title_subj.keys()))):
    #print(list(title_subj.keys())[i])
    for genres in list(title_subj.values())[i]:
        #print(genres)
        for genre in genres.split(','):
            if subject in genre.lower():
                titles.append(list(title_subj.keys())[i])
np.unique(titles)

In [None]:
list = []
subject = "Mentally ill women -- Fiction"
for i in title_subj:
    if "Mentally ill women -- Fiction" in title_subj[i]:
        list.append(i)
list

In [None]:
all_subj=[]
for subj in list(title_subj.values()):
    all_subj += subj
len(all_subj)
#all_subj

uniq_subj = []
#new_list = sum(list(title_subj.values()), [])
#new_list
conv_to_arr = np.array(all_subj)
uniq_subj = list(np.unique(conv_to_arr))
len(uniq_subj)
#type(uniq_subj)
#uniq_subj

**Question 13.** Using a combination of for loops and string methods, return a list of all titles from your dictionary that are about a subject you chose. 

**Note:** Alternatively, if you choose, you can use regex and list comprehension, but this is not a requirement. 

In [None]:
...

## Finding Books by Category

In this section, you will help us to find books with an easier reading level. The data set contains a variety of variables on the difficulty of books. To see all the difficulty measures you can look at the column names.

**Question 14.** Choose two difficulty measures. Find the mean, standard deviation, minimum, median, and maximum values.

In [None]:
...

In [None]:
...

Feature scaling (also known as data normalization) is a method used to standardize the range of features of data. Since, the range of values of data may vary widely, it becomes a necessary step in data pre-processing. Often we normalize variables when performing some type of analysis in which we have multiple variables that are measured on different scales and we want each of the variables to have the same range. This prevents one variable from being overly influential, especially if it's measured in different units (i.e. if one variable is measured in miles and another is measured in feet).

In scaling, you transform the data such that the features are within a specific range e.g. $[0, 1]$. To normalize the values in a data set to be between 0 and 1, you can use the formula $$z_i=\frac{x_i-\text{min}(x)}{\text{max}(x)-\text{min}(x)}$$ or $$\text{Scaled Value}=\frac{\text{Value}-\text{Minimum of all Values}}{\text{Maximum of all Values}-\text{Minimum of all Values}}$$ 

In other words, subtract the minimum value from each value, then divide by the difference between the maximum and minimum values.

**Question 15.** Apply the scaling method explained above to the difficulty measures selected in **Question 14**. Scale each measure to be between 0 and 1. Save your results to a dataframe with a column for book name, author, and each difficulty measure score (two columns).

In [None]:
...

In [None]:
...

**Question 16.** Based on the sum of scaled difficulty measures you chose in **Question 15**, which book is the easiest to read? Do you agree with this? Why or why not?

**Note:** Add the sum of the difficulty measures to the dataframe you created in **Question 15**. Then in he markdown cell below the code cell, type your response to the questions *"Do you agree with this? Why or why not?"*.

In [None]:
...

_Type your response here._

## Submission

Make sure you have run all cells in your notebook in order so that all images/graphs appear in the output. **Please save before exporting!**

When done with running all the cells in the notebook and **saving**, download the notebook (`.ipynb`) by right-clicking on the file name and selecting **Download**. You'll submit this `.ipynb` file for the assignment in Canvas to Gradescope for grading.