# Project Plan

- Write a web scraper that scrapes title, description, metrics etc. on a list of marvel superheroes from Wikipedia
- Uses PyMYSQL to write to a database 
	- Several tables 
- Data Modeling in either PowerBI or SQL itself
- Visualization that looks very professional and fun


Table 1 - Character Table
ID | Name | CharacterDesignation | Description | CharacterLink | NumAppearances 

Table 2 - Actor
ID | Actor | Character | Age | NumFilms | YearsInMCU

Table 3 - Films
ID | Title | PremiereDate | Director | Producer | Starring | Cinematography | ProductionCompany| Distributor | Country | Language | Screenplay

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import requests
from bs4 import BeautifulSoup
import csv
import pymysql

# Table 1: Character Scraper

In [2]:
# https://en.wikipedia.org/wiki/Characters_of_the_Marvel_Cinematic_Universe

'''class Character(self, name, designation, description, link):
    self.name = name
    self.designation = designation
    self.description = description
    self.link = link
'''
def get_major_characters():
    html = requests.get('https://en.wikipedia.org/wiki/Characters_of_the_Marvel_Cinematic_Universe')
    bs = BeautifulSoup(html.text, 'lxml')
    
    #Character Names
    try:
        names = bs.find_all('h3')
        names = [names[i].getText().rstrip('[edit]') for i in range(len(names)-2) if not names[i].getText().startswith('Introduced')]
    except:
        print('There was an error pulling character names.')
    
    #Character Descriptions
    descriptions = []
    try:
        for heading in bs.find_all('h3'):
            description = heading.findNext('p')
            if not description.getText().startswith('The depiction of adapted and original characters in the MCU'):
                descriptions.append(description.getText())
    except:
        print('There was an error in pulling descriptions.')
    
    #Movie Appearances
    appearances = []
    try: 
        for heading in bs.find_all('h3'):
            appearance = heading.find_previous_sibling('p')
            if not appearance.getText().startswith('The following is a supplementary list of characters that appear in lesser roles') and not appearance.getText().startswith('Phase Four'):
                appearances.append(appearance.getText())
    except:
        print('There was an error in pulling appearances.')
        
    try:
        last_appearance = bs.find('span',id = 'Minor_characters').parent.find_previous_sibling('p').getText()
        appearances.append(last_appearance)
    except:
        print('Could not pull final appearance.')
        
    #Designation (find the previous h2)
    designations = []
    try:
        ref_name = bs.find_all('h3')
        for item in ref_name:
            if not item.getText().startswith('Introduced') and not item.getText() == 'Search':
                designation = item.find_previous_sibling('h2').getText()
                designations.append(designation)
    except:
        print('There was an error in pulling designations.')
                
    return names, designations, descriptions, appearances


In [3]:
names, designations, descriptions, appearances = get_major_characters()

There was an error in pulling descriptions.
There was an error in pulling appearances.
There was an error in pulling designations.


In [4]:
print(f'There are {len(names)} in names')
print(f'There are {len(descriptions)} in descriptions')
print(f'There are {len(appearances)} in appearances')
print(f'There are {len(designations)} in designations')

There are 134 in names
There are 134 in descriptions
There are 134 in appearances
There are 134 in designations


In [5]:
character_data = list(zip(names, designations, descriptions, appearances))

In [6]:
character_df = pd.DataFrame(character_data, columns = ['Name','Designation','Description','Appearances'])

<class 'list'>


In [26]:
character_data_dict = dict((z[0],list(z[1:])) for z in zip(names, designations,descriptions, appearances))

Bruce Banner / Hulk: Central characters[edit]. Dr. Bruce Banner (initially portrayed by Edward Norton and subsequently by Mark Ruffalo)[4] is a founding member of the Avengers and a genius physicist who, because of exposure to gamma radiation, transforms into a green monster—known as the Hulk—when enraged or agitated. When transformed he demonstrates superhuman strength and endurance.[5][6]

Bucky Barnes / Winter Soldier / White Wolf: Central characters[edit]. James Buchanan "Bucky" Barnes (portrayed by Sebastian Stan), also known as the Winter Soldier and White Wolf, is Steve Rogers' childhood best friend and confidant.[10] During World War II, Barnes served as a sergeant in the United States Army and as a member of Rogers' squad of commandos, where he was supposedly killed in action. Captured by and experimented on by Hydra scientists, Barnes was kept in suspended animation, reemerging in the modern world as an enhanced brainwashed assassin, known as the Winter Soldier.[11] In 2016, 

# Write To Characters Table

In [29]:
conn = pymysql.connect(host = '127.0.0.1',user = 'root', passwd = 'Ipgatt77', db = 'mysql', 
                       charset = 'utf8')
cur = conn.cursor()
cur.execute('USE marvel_database')

def store_characters(name, designation, description, appearance):
    cur.execute('INSERT INTO characters (char_name,char_type,char_desc,appearances) VALUES ''("%s","%s","%s","%s")',
                (name, designation, description, appearance))
    cur.connection.commit()
try:
    for key, value in character_data_dict.items():
        store_characters(key,value[0],value[1],value[2])
finally:
    cur.close()
    conn.close()

Fields I might want to add to this table:
- Number of Appearances (calculation off of appearances column)
- Country
- Debut Year

I also might want to go back and scrape minor characters as well. 