Author: Nicholas Orgel

Creation Date: 02.01.2023

# Task
Clean the files & combine them into one final DataFrame.

- This dataframe should have the following columns:
    - Hero (Just the name of the Hero)
    - Publisher
    - Gender
    - Eye Color
    - Race
    - Hair Color
    - Height (numeric)
    - Skin Color
    - Alignment
    - Weight (numeric)
    
    - **Plus, one-hot-encoded columns for every power that appears in the dataset. E.g.:**
        - Agility
        - Flight
        - Superspeed
        - etc.
        
---

**Hint: There is a space in "100 kg" or "52.5 cm"**

## Questions To Answer

**II: Use your combined DataFrame to answer the following questions.**

- 1. Compare the average weight of superheroes who have Super Speed to those who do not.
- 2. What is the average height of heroes for each publisher?

---

# Steps

## Import Libraries

In [47]:
import pymysql
pymysql.install_as_MySQLdb()
import pandas as pd

#sklearn
from sklearn.preprocessing import OneHotEncoder

## Load DataFrames

In [4]:
hero_info = pd.read_csv('Data/superhero_info.csv')
hero_powers = pd.read_csv('Data/superhero_powers.csv')

In [5]:
# Load hero_info dataset
hero_info.head()

Unnamed: 0,Hero|Publisher,Gender,Race,Alignment,Hair color,Eye color,Skin color,Measurements
0,A-Bomb|Marvel Comics,Male,Human,good,No Hair,yellow,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"
1,Abe Sapien|Dark Horse Comics,Male,Icthyo Sapien,good,No Hair,blue,blue,"{'Height': '191.0 cm', 'Weight': '65.0 kg'}"
2,Abin Sur|DC Comics,Male,Ungaran,good,No Hair,blue,red,"{'Height': '185.0 cm', 'Weight': '90.0 kg'}"
3,Abomination|Marvel Comics,Male,Human / Radiation,bad,No Hair,green,Unknown,"{'Height': '203.0 cm', 'Weight': '441.0 kg'}"
4,Absorbing Man|Marvel Comics,Male,Human,bad,No Hair,blue,Unknown,"{'Height': '193.0 cm', 'Weight': '122.0 kg'}"


In [48]:
hero_info.dtypes

Hero|Publisher    object
Gender            object
Race              object
Alignment         object
Hair color        object
Eye color         object
Skin color        object
Measurements      object
dtype: object

## hero_info

---
In the **hero_info** dataset the column **Hero|Publisher** needs to separated. And the **Measurements** column needs to be separated into ***'Height'* & *'Weight'***

### Data cleaning hero_info

### Check for duplicates

In [18]:
# Check for duplicate values
hero_info.duplicated().sum()

# There is no duplicated data in hero_info

0

### Check for missing values

In [12]:
# Check for missing values
hero_info.isna().sum()

# There is no missing data in hero_info

Hero|Publisher    0
Gender            0
Race              0
Alignment         0
Hair color        0
Eye color         0
Skin color        0
Measurements      0
dtype: int64

---

## hero_powers

---
**hero_powers** dataset has columns that need to be changed:
    - **hero_names** needs to be changed to **Hero** & combined with **Hero|Publisher** once separated.
    - **Powers** needs to be ***one-hot-encoded***.

In [22]:
hero_powers.head()

Unnamed: 0,hero_names,Powers
0,3-D Man,"Agility,Super Strength,Stamina,Super Speed"
1,A-Bomb,"Accelerated Healing,Durability,Longevity,Super..."
2,Abe Sapien,"Agility,Accelerated Healing,Cold Resistance,Du..."
3,Abin Sur,Lantern Power Ring
4,Abomination,"Accelerated Healing,Intelligence,Super Strengt..."


### Check for duplicates

In [17]:
# Check for duplicates
hero_powers.duplicated().sum()

# There is no duplicated data in hero_powers

0

### Check for missing values

In [16]:
# Check for missing data
hero_powers.isna().sum()

# There is no missing data in hero_powers

hero_names    0
Powers        0
dtype: int64

### Change column name hero_names

In [45]:
# Change column name 'hero_names' to 'Hero'
hero_powers = hero_powers.rename(columns={'hero_names': 'Hero'})

In [46]:
hero_powers.head()

Unnamed: 0,Hero,Powers
0,3-D Man,"Agility,Super Strength,Stamina,Super Speed"
1,A-Bomb,"Accelerated Healing,Durability,Longevity,Super..."
2,Abe Sapien,"Agility,Accelerated Healing,Cold Resistance,Du..."
3,Abin Sur,Lantern Power Ring
4,Abomination,"Accelerated Healing,Intelligence,Super Strengt..."
