# Lab 2 (Due @ by 1:59 am via Canvas/Gradescope)

Your Name: Tejadatta Kalapatapu

Due: Saturday, Oct. 5 @ 1:59 am

### Submission Instructions
Submit this `ipynb` file to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope. **In addition:**
- Make sure your name is entered above
- Make sure you comment your code effectively
- If problems are difficult for the TAs/Profs to grade, you will lose points

### Tips for success
- Collaborate: bounce ideas off of each other, if you are having trouble you can ask your classmates or Dr. Singhal for help with specific issues, however...
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), i.e. you are welcome to **talk about** (*not* show each other your answers to) the problems.

In [1]:
# you might use the below modules on this lab
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## Part 1: Understanding Cleaning
### Part 1.1: Grabbing Data and Preliminary Cleaning (10 points)

We wish to create a data frame that includes all the spells for each class (a "class" is something like a "wizard", or a "bard") in Dungeons and Dragons 5th Edition, which you can find [here](http://dnd5e.wikidot.com/). Your final data frame should look something like:

| Class     | Level     | Spell Name    | School      | Casting Time | Range                | Duration      | Components |
|----------:|----------:|--------------:|------------:|-------------:|---------------------:|--------------:|-----------:|
| Artificer | Level 0   | Acid Splash   | Conjuration | 1 Action     | 60 Feet              | Instantaneous | V, S       |
| Artificer | Level 0   | Booming Blade | Evocation   | 1 Action     | Self (5-foot radius) | 1 Round       | S, M       |
| ...       | ...       | ...           | ...         | ...          | ...                  | ...           | ...        |
| Wizard    | Level 9   | Wish          | Conjuration | 1 Action     | Self                 | Instantaneous | V          |

Below are two functions which:
- takes a class (string) as an argument and returns the tables from the class's DND wiki spell page in a dictionary for each spell level
- takes a list of classes, applies the first function to each of them, then combines all the tables into a data frame, including a column with class name and a column with spell level

**DO NOT CHANGE ANYTHING IN THE BODY OF THE FUNCTIONS.**

**In a markdown cell** create a bullet point list where you explain each what each chunk of code does. Your bullet point list should have **FIVE** bullet points/explanations corresponding to the four chunks below the `# EXPLAIN THIS (number)` comments. You must accurately summarize the content and procedure of each code chunk. **Talking to your neighbors/group about this is highly recommended.**

In [None]:
def get_class_spell_dict(dnd_class):
    """ takes a D&D class (string) and gets the spell tables and saves them in a dictionary
    
    Args:
        dnd_class (str): the D&D class
        
    Returns:
        table_dict (dict): a dictionary of tables, one for each spell level
    """

    # EXPLAIN THIS (1)
    url = f'http://dnd5e.wikidot.com/spells:{dnd_class}'
    tables = pd.read_html(url)
    table_dict = {}
    for i in range(len(tables)):
        table_dict[f'Level {i}'] = tables[i]

    return table_dict

def get_full_spell_df(class_list):
    """ takes a list of D&D classes (list of strings), applies the get_class_spell_dict() function to them, and then combines them into a data frame

    Args:
        class_list (list): a list of strings

    Returns:
        spells_df (data frame): a data frame with all the spells
    """

    spells_df = pd.DataFrame()
    level_list = []
    long_class_list = []
    
    # EXPLAIN THIS (2)
    for class_ in class_list:
        class_dict = get_class_spell_dict(class_)
        class_df = pd.DataFrame()

        # EXPLAIN THIS (3)
        for level in class_dict:
            level_list.append([level] * len(class_dict[level]))
            class_df = pd.concat([class_df, class_dict[level]])

        # EXPLAIN THIS (4)
        long_class_list.append([class_] * len(class_df))
        spells_df = pd.concat([spells_df, class_df])

    # EXPLAIN THIS (5)
    spells_df.insert(0, 'Level', [item for sublist in level_list for item in sublist])
    spells_df.insert(0, 'Class', [item for sublist in long_class_list for item in sublist])
    
    return spells_df

class_list = ['Artificer', 'Bard', 'Cleric', 'Druid', 'Paladin', 'Ranger', 'Sorcerer', 'Warlock', 'Wizard']
notclean_df = get_full_spell_df(class_list)
notclean_df

Your answers here:

- The url link is used to scrape the spell list for a specific dnd class and the next line is used to read the HTML tables from the url and return the dataframes on the page, each one being spells for a certain level. Then an empty dictionary is created and the for loop just adds each table of spells to the dictionary 
- The for loop goes through every class in the "class_list" and for each class, the function "get_class_spell_dict" is called on and the spell tables are obtained from each spell level and they are stored in the empty dataframe "class_df" 
- The for loop is going through each level in "class_dict" and appends the level data to "level_list" and then concatenates the spell level to "class_df"
- The class names are appended to "long_class_list" and concatenates "class_df" to "spells_df"
- Two new columns, "Level" and "Class", are inserted into "spells_df" by using "level_list" and "long_class_list"

### Part 1.2: More Cleaning (15 points)

The "final" data frame from the previous part is still not as clean as it could be. In a markdown cell, perform these two tasks:

1. Write a short paragraph (at least four sentences) discussing what else you would do to continue cleaning up the data
2. Think about the `Components` column specifically, write out some pseudo code (you can see how I did the below example by double clicking on this cell) that roughly outlines how you would go about cleaning that column

```
def my_cleaning_func(column):
    """ this function cleans a column from a data frame

    Args: column (Series)

    Returns: clean_column (Series)
    """

    # take the column
    # clean the column (I have written comments for these steps, YOU SHOULD WRITE PSEUDO-CODE)
    # save it as clean_column

    return clean_column
```

Your answers here:

- To continue cleaning up the data, I would remove all duplicate values first. Then I would convert all the data to its proper data type. I would also create separate columns for the components for further clarity and ease of access. Finally, I would create consistent values for the "duration" and "range" columns because they are all different strings that are hard to manipulate as of right now because of the specific types of wording.
-

```
def clean_components(column):
    """ This function cleans the Components column from the spells dataframe 
    
    Arg: column (Series)
    
    Returns: clean_column (Series)
    
    """

    #empty lists for each component
    v = []
    s = []
    m = []

    #going through all the entries
    for entry in column:
        #splitting the components
        c = entry.upper().split(', ')
        
        #check if the component exists in that spell and append if it does
        v.append('Verbal' in c)
        s.append('Somatic' in c)
        m.append('Material' in c)
    
    #new cleaned df
    clean_column = pd.DataFrame({
        'Verbal': verbal,
        'Somatic': somatic,
        'Material': material
    })

    return clean_column
```

# Part 2: Summarizing and Visualizing Data

This problem uses `evdataset.csv`, available in the Labs Module on Canvas, which was taken and adapted from Kaggle (no longer hosted) and contains a sample of 194 electric vehicles on the market until 2022. The full dataset includes basic technical specifications, battery capacity and range in various weather and road conditions.

In [None]:
df_ev = pd.read_csv('evdataset.csv', index_col='id')
df_ev.head()

## Part 2.1: Numeric Summaries (25 points)

On your own or with a classmate, discuss which features you think would be most interesting to compare across different drives. Pick two or three of them and, after using `.groupby()` to group by the `drive` feature, calculate for all of them:

- means
- medians
- standard deviations

Then, using the original data set, look at the pairwise correlations (with the correlation matrix, check out the [`pd.corr()` documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)). Finally, **in a markdown cell** discuss your key takeaways from the numeric summaries you calculated, and what the correlations were between your chosen features. Where they among the strongest/weakest relationships? Do you think the type of drive may impact these relationships? Any other interesting results of note?


In [None]:
#creating a group of "electricrange", "totalpower", and "batterycapacity"
group = df_ev.groupby("drive")[["electricrange", "totalpower", "batterycapacity"]]

#getting the mean, median, and standard deviation
means = group.mean()
med = group.median()
stddev = group.std()

print("Means:\n", means)
print("\nMedians:\n", med)
print("\nStandard Deviations:\n", stddev)

#getting the correlation matrix
corr_mat = df_ev[["electricrange", "totalpower", "batterycapacity"]].corr()
print("\nCorrelation Matrix:\n", corr_mat)

Some key takeaways I got from this data is that there is a strong positive correlation between electric range and battery capacity, which make sense because the range would be higher if the battery is better. On the other hand, there is a weak correlation between electric range and total power, meaning a higher power does not lead to a higher range. Furthermore, AWD cars have a pattern of having a higher total power but a lower electric range compared to rear wheel drive cars.

## Part 2.2: Visual Summaries (25 points)

Again choose two or three features (they can be the same or different as those from the previous part) and make a few plots to further your understanding of the data. For the first two plots, you may use any of `matplotlib`, `seaborn` or `plotly` (you may find some easier to use than others). Please make:

- Histograms for each drive type (i.e. three histograms, one for each of: AWD, Front, Rear) for one of your chosen features. You may make them separately or within a subplot.
- A scatterplot of two of your features, with points colored by drive type.
- Check out the [seaborn plot options again](https://seaborn.pydata.org/examples/index.html) and pick one to use with your chosen features (exercise some thought as to what you are hoping the plot will communicate; you may find it worthwhile to discuss options with your classmates).

Then, **in a markdown cell** discuss what you learned from the plots you created. If you used the same features that you investigated numerically, did the plots corroborate your findings? Or did they provide new insight? If you used new features, what do the plots tell you about what the numeric reationship(s) between the features might be? Ay other interesting results to note?

In [None]:
#importing necessary packages
import matplotlib.pyplot as plt
import seaborn as sns

#creating the histograms using for loop for all types of drive
plt.figure(figsize=(15, 5))
for i, drive in enumerate(["AWD", "Front", "Rear"]):
    plt.subplot(1, 3, i + 1)
    df_ev[df_ev["drive"] == drive]["electricrange"].hist()
    plt.title(f"{drive} Electric Range")
plt.show()

#creating the scatterplot
plt.figure(figsize=(10, 5))
sns.scatterplot(data = df_ev, x = "batterycapacity", y = "electricrange", hue = "drive")
plt.title("Battery Capacity vs. Electric Range")
plt.show()

#creating the sns plot
plt.figure(figsize=(10, 5))
sns.boxplot(data = df_ev, x = "drive", y = "totalpower")
plt.title("Total Power by Drive Type")
plt.show()

The histograms display that the rear wheel drive cars have a broader range of electric range, while vice versa for the front wheel drive. The scatterplot shows what I had said before about the strong positive correlation between electric range and battery capacity and AWD has bigger battery capacities on average. The boxplots show that AWD have the highest total power on average, while the front wheel drive cars have the lowest total power on average.

## Part 2.3: Future Considerations (25 points)

1) Explicitly calculate the variance of all the numeric features in the raw `df_ev` data set, as well as the covariance matrix. 

Then, in a few sentences (**in a markdown cell**) discuss in detail:

(a) why some variances are larger than others, 

(b) why the covariances between the different features are not as useful as the correlations you calculated in Part 2.1 (**pick a couple** of example relationships to illustrate the point(s) you make, 

and (c) if the relationships we see between the features based on the correlation matrix from Part 2.1 are necessarily the true relationships between those features. Think about the meme that was shown in the class:

![d](https://miro.medium.com/v2/resize:fit:547/1*2BnD3YAUBGNutkKiG5dKfg.jpeg)

In [None]:
#using numbers only
numeric_columns = df_ev.select_dtypes(include=[np.number]).columns

#variances
variances = df_ev[numeric_columns].var()
print("Variances:\n", variances)

#covariance matrix
cov_matrix = df_ev[numeric_columns].cov()
print("Covariance Matrix:\n", cov_matrix)

a. Some variances are larger than others because of the different units of measurement for throughout the categories, such as length being in millimeters and acceleration being in seconds. Furthermore, electric capacity and total power will also vary heavily because they range from big batteries/motors to small batteries/motors.

b. Covariances are not as useful as the correlations I calculated before because they aren't standardized and the way the categories are measured differ, meaning even though the covariance between two variables such as battery capacity and electric range is going to have a positive correlation, its magnitude would be hard to analyze as the  units of measurement are different.

c. The relationships we see between the features based on the correlation matrix from Part 2.1 aren't always necessarily true because of the simple fact that correlation does not always imply causation. Two completely different and random variables might have a positive correlation, but they might not have anything to do with each other, such as the meme in the picture, which is comparing their sales to shaved heads, which are completely unrelated but still have a strong positive correlation.