# Test 1 - Part 3

## Cleaning Comic Data

The file `./data/Comic_Data_Messy.csv` contains data on comic book characters scraped from some wiki.  Complete the of the following tasks.

In [266]:
!ls data

Comic_Data_Messy.csv  comics_clean.csv


#### Problem 1 [10 points]

1. Load the data into a `pandas` dataframe.
2. Inspect the unique values in the `comic` column.
3. Comment on problems with the column.

In [267]:
import pandas as pd
from dfply import *
from more_dfply.facets import text_filter

In [268]:
# Your code here
comic_data = pd.read_csv("./data/Comic_Data_Messy.csv")
comic_data.comic.unique() #This is from my first submission, I have used unique() here.

array(['marvel', 'marvl', 'DC comics', 'DC', 'Marvel Comics',
       'marvelcomics', 'dc comics', 'Marvel', 'MV', 'DC Comics'],
      dtype=object)

In [269]:
comic_data.comic.unique

<bound method Series.unique of 0               marvel
1                marvl
2            DC comics
3                   DC
4                marvl
             ...      
23267        DC Comics
23268           marvel
23269           marvel
23270    Marvel Comics
23271           marvel
Name: comic, Length: 23272, dtype: object>

> <font color="orange"> Your thoughts here </font>

> Marvel and DC comics has multiple names in the column. Marvel has marvl, marvel comics, MV, and so on. DC has similar like Marvel, such as DC Comics, DC comics, DC, and so on. We have to clean the column so that both the franchise have only one single name, Marvel and DC.

#### Problem 2 [10 points]

Clean up the `comic` column.

In [270]:
# Your code here
# View Cell

cleaned = (comic_data
 >> select(X.comic)
 >> filter_by(~text_filter(X.comic, r'\w{6}\s\w{6}', regex=True))
 >> filter_by(~text_filter(X.comic, r'\w{2}\s\w{6}', regex=True))
 >> filter_by(~text_filter(X.comic, r'DC', regex=True))
 >> filter_by(~text_filter(X.comic, r'marvl', regex=True))
 >> filter_by(~text_filter(X.comic, r'MV', regex=True))
 >> filter_by(text_filter(X.comic, r'marvel', regex=True))
 )



In [271]:
cleaned.comic.unique()

array(['marvel', 'marvelcomics', 'Marvel'], dtype=object)

In [272]:
# Tranform Cell

# Mutate in place used
cleaned_data = (comic_data
    >> mutate(Comic = case_when((text_filter(X.comic, r'\w{6}\s\w{6}', regex=True), "Marvel"),
                                (text_filter(X.comic, r'\w{2}\s\w{6}', regex=True), "DC"),
                                (text_filter(X.comic, r'marvl', regex=True), "Marvel"),
                                (text_filter(X.comic, r'MV', regex=True), "Marvel"),
                                (text_filter(X.comic, r'marvel', regex=True), "Marvel"),
                                (text_filter(X.comic, r'DC', regex=True), "DC"),
                               )
             )
    >> drop(X.comic)
)

  return coalescer.lookup(np.arange(coalescer.shape[0]), min_nonna)


In [273]:
cleaned_data.Comic.unique()

array(['Marvel', 'DC'], dtype=object)

#### Problem 3 [10 points]

The `PHYSICAL` column contains information about both the character's hair and eyes.  Extract this information into separate columns and clean up as necessary.

In [274]:
# Your code here
cleaned_data_2 = (cleaned_data
 >> mutate(Eyes = X.PHYSICAL.str.extract(r"(\w+) Eye[s|balls]+"),Hair = X.PHYSICAL.str.extract(r", (.*)").replace(" Hair", "", regex=True))
 >> mutate(Hair = if_else(X.Hair == "", "Not Available", X.Hair), Eyes = if_else(X.Eyes.isna(), "Not Available", X.Eyes))
 >> drop(X.PHYSICAL)
          )

In [275]:
# just testing some ideas out, could be achieved in a much simpler way wihtout trying to find optional 
# word matching for hair from stackoverflow and avoiding regex for eyes overall
# (cleaned_data 
#  >> mutate(Eyes = X.PHYSICAL.str.split(",").str.get(0).str.split(" ").str.get(0), 
#            Hair = X.PHYSICAL.str.split(",").str.get(1).str.replace("Hair", " "))
#  >> mutate(Eyes = if_else(X.Eyes == "", "NA", X.Eyes), 
#            Hair = if_else(X.Hair == " ", "NA", X.Hair) )
#  >> drop(X.PHYSICAL)
# )

#### Problem 4 [5 points]

The `url_slug` column contains the characters name, universe (at the end in parentheses) and a possible alias/nickname (middle of the name in parentheses).  Extract this information into separate columns.

In [284]:
# Your code here
cleaned_data_3 = (cleaned_data_2
 >> mutate(
     Name = X.urlslug.str.extract(r"\\/(\w+)_").replace("_", " ", regex = True).replace(np.nan,"Not Available"),
     Universe = X.urlslug.str.extract(r"_\((\w+[_-]\w+)\)").replace("_", " ", regex = True).replace(np.nan,"Not Available"),
     Nickname = X.urlslug.str.extract(r"_\((\w+)\)_").replace(np.nan, "Not Available")
 )
 >> drop(X.urlslug)
)

In [289]:
cleaned_data_3

Unnamed: 0,page_id,ID,ALIGN,SEX,ALIVE,APPEARANCES,FIRST.APPEARANCE,Comic,Eyes,Hair,Name,Universe,Nickname
0,666101,Public Identity,,Male Characters,Living Characters,4.0,Apr-97,Marvel,Blue,Brown,Jonathan Dillon,Earth-616,Not Available
1,280850,Public Identity,,Male Characters,Deceased Characters,,Oct-01,Marvel,Blue,Blond,John,Earth-616,Mutant
2,129267,Public Identity,Good Characters,Male Characters,Living Characters,15.0,"1987, September",DC,Not Available,Black,Gene LaBostrie,New Earth,Not Available
3,157368,Public Identity,Good Characters,Male Characters,Deceased Characters,15.0,"1992, September",DC,Black,Not Available,Reemuz,New Earth,Not Available
4,16171,Secret Identity,Bad Characters,Male Characters,Living Characters,1.0,Jul-73,Marvel,Not Available,Not Available,Aquon,Earth-616,Not Available
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23267,183949,,Bad Characters,Female Characters,Living Characters,2.0,"2009, October",DC,Not Available,Not Available,Queen of Hearts IV,New Earth,Not Available
23268,30345,Secret Identity,Bad Characters,Male Characters,Living Characters,3.0,Dec-93,Marvel,Not Available,Brown,Regent,Earth-616,Not Available
23269,432532,Secret Identity,Good Characters,Male Characters,Deceased Characters,1.0,Mar-11,Marvel,Blue,Black,Malcolm Monroe,Earth-616,Not Available
23270,16723,Secret Identity,Neutral Characters,Female Characters,Living Characters,2.0,Oct-97,Marvel,Not Available,No,Tether,Earth-616,Not Available


#### Problem 5 [5 points]

Write the your resulting table to a file named `comics_clean.csv` and push you code and this CSV to GitHub.

In [290]:
# Your code here
cleaned_data_3.to_csv("./data/comics_clean.csv")