# Star Wars Survey

In this project we're going to be analyzing data on the Star Wars movies. FiveThirtyEight is a website that focuses on opinion poll analysis and before Star Wars: The Force Awakens released the team was interesting in answering some questions about Star Wars fans. They collected data using the online tool SurveyMonkey which generated 835 total responses, we'll be analyzing this data throughout the project.

Several columns the data includes are;

- RespondentID - an anonymized ID for the respondent
- Gender - the repondent's gender
- Household Income - respondent's income
- Education - respondent's education level
- Location (Census Region) - respondent's location 
- Have you seen any of the 6 films in the Star Wars franchise ? - Yes/No response
- Do you consider yourself to be a fan of the Star Wars film franchise ? - Yes/No response


In [40]:
# import pandas 
import pandas as pd

# since some of the data set includes characters that arent in the default utf-8 encoding
# we'll be passing ISO-8859-1 into pd.read_csv

star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')

In [41]:
# explore first 10 
star_wars.head(10)

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,3292880000.0,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
3,3292765000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
4,3292763000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
5,3292731000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
6,3292719000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,1,...,Very favorably,Han,Yes,No,Yes,Male,18-29,"$25,000 - $49,999",Bachelor degree,Middle Atlantic
7,3292685000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,6,...,Very favorably,Han,Yes,No,No,Male,18-29,,High school degree,East North Central
8,3292664000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,4,...,Very favorably,Han,No,,Yes,Male,18-29,,High school degree,South Atlantic
9,3292654000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Somewhat favorably,Han,No,,No,Male,18-29,"$0 - $24,999",Some college or Associate degree,South Atlantic


Looking at the first 10 rows we can see that the RespondentID is in scientific notation, some values are just Response maybe the user did not answer yes or no and instead of NaN its just Response, lastly there is a lot of NaN values for some columns.

In [42]:
# review columns
star_wars.columns

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expan

It seems that instead of the movie names all the movies are just labeled Unnamed: number, also the character names again are just labeled as Unnamed: number.

In [43]:
# removing any rows with NaN value for RespondentID
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]

# Converting Yes/No responses to True/False/NaN

We can make the data easier to analyze if we convert Yes and No responses to boolean values so we won't need to do string comparison later on. We'll be isomg the series.map() method to perform the conversion and after we can check the conversion with value_counts().

In [44]:
# creating a mapping dict
yes_no = {
    'Yes':True,
    'No':False
}

# mapping to the following columns;
# Have you seen any of the 6 films in the Star Wars franchise?
# Do you consider yourself to be a fan of the Star Wars film franchise?

for v in ['Have you seen any of the 6 films in the Star Wars franchise?',
         'Do you consider yourself to be a fan of the Star Wars film franchise?'
         ]:
    star_wars[v] = star_wars[v].map(yes_no)

In [45]:
# value counts for Have you seen any of the 6 films in the Star Wars franchise
# NaN included
star_wars.iloc[:,1].value_counts(dropna=False)

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [46]:
# value counts for Do you consider yourself to be a fan of the Star Wars film franchise?
# NaN included
star_wars.iloc[:,2].value_counts(dropna=False)

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

# Renaming Unnamed columns 

The data set contains columns which have the label "Unnamed", for example; Unnamed: 4 respresents if the respondent has seen Episode II Attack of the Clones, if the respondent has seen Episode II the value would be "Star Wars: Episode II Attack of the Clones" instead of Yes or No. 

Since this is the case we'll be mapping the movie titles for each column to a True value, so for the Unnamed: 4 column, if the value for the respondent is "Star Wars: Episode II Attack of the Clones" we'll replace it with True meaning the respondent has seen the movie.

Just to note the column; "Which of the following Star Wars films have you seen ? Please select all that apply." represents "Star Wars: Episode I The Phantom Menace" the list below will outline the changes the first being for The Phantom Menace which will be refered to as 'Which of the ....'

We will make the following changes to column names:

- 'Which of the ...' > seen_1
- Unnamed: 4 > seen_2
- Unnamed: 5 > seen_3
- Unnamed: 6 > seen_4
- Unnamed: 7 > seen_5
- Unnamed: 8 > seen_6

Below is a list to reference the movie title for each Star Wars movie for anyone that has not seen them or is not familiar with the names of each movie: 

- seen_1 = Star Wars: Episode I The Phantom Menace
- seen_2 = Star Wars: Episode II Atack of the Clones
- seen_3 = Star Wars: Episode III Revenge of the Sith
- seen_4 = Star Wars: Episode IV A New Hope
- seen_5 = Star Wars: Episode V The Empire Strikes Back
- seen_6 = Star Wars: Episode VI Return of the Jedi

In [47]:
# import numpy
import numpy as np

# creating a map making sure to note that some titles have a double space
# after Episode example; Episode I  The Phantom Menace - not the case for all titles

movie_mapping = {
    'Star Wars: Episode I  The Phantom Menace' : True, np.nan: False,
    'Star Wars: Episode II  Atack of the Clones' : True,
    'Star Wars: Episode III  Revenge of the Sith': True,
    'Star Wars: Episode IV  A New Hope ': True,
    'Star Wars: Episode V The Empire Strikes Back' : True,
    'Star Wars: Episode VI Return of the Jedi': True
}

for v in star_wars.columns[3:9]:
    star_wars[v] = star_wars[v].map(movie_mapping)

In [48]:
# lets rename the columns now to seen_
star_wars = star_wars.rename(columns={
    'Which of the following Star Wars films have you seen ? Please select all that apply.' : 'seen_1',
    'Unnamed: 4': 'seen_2',
    'Unnamed: 5': 'seen_3',
    'Unnamed: 6' : 'seen_4',
    'Unnamed: 7': 'seen_5',
    'Unnamed: 8': 'seen_6'
})

In [49]:
# checking columns
star_wars.head()

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,seen_2,seen_3,seen_4,seen_5,seen_6,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
1,3292880000.0,True,True,True,,True,,True,True,3.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,3292880000.0,False,,False,False,False,False,False,False,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
3,3292765000.0,True,False,True,,True,False,False,False,1.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
4,3292763000.0,True,True,True,,True,,True,True,5.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
5,3292731000.0,True,True,True,,True,,True,True,5.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


# Renaming more Unnamed columns

The next six columns asked respondents to rank the Star Wars movies in order of least favorite to most favorite, 1 means the film was the most favorite and 6 means it was the least favorite. Each of the following columns can have values from 1-6 or NaN:

- Please rank the Star Wars films in order of prefernce with 1 being your favorite film in the franchise and 6 being your least favorite film.
- Unnamed: 10
- Unnamed: 11
- Unnamed: 12
- Unnamed: 13
- Unnamed: 14

We'll rename these columns like we previously did but this time as ranking_ :

- Please rank the ... > ranking_1
- Unnamed: 10 > ranking_2
- Unnamed: 11 > ranking_3
- Unnamed: 12 > ranking_4
- Unnamed: 13 > ranking_5
- Unnamed: 14 > ranking_6

The following list is just a reference in terms of which movie these columns are in reference to :

- Please rank the .. = Star Wars: Episode I The Phantom Menace
- Unnamed: 10 = Star Wars: Episode II Atack of the Clones
- Unnamed: 11 = Star Wars: Episode III Revenge of the Sith
- Unnamed: 12 = Star Wars: Episode IV A New Hope
- Unnamed: 13 = Star Wars: Episode V The Empire Strikes Back
- Unnamed: 14 = Star Wars: Episode VI Return of the Jedi


In [51]:
# renaming columns
star_wars = star_wars.rename(columns={
    'Please rank the Star Wars films in order of prefernce with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_1',
    'Unnamed: 10': 'ranking_2',
    'Unnamed: 11': 'ranking_3',
    'Unnamed: 12' : 'ranking_4',
    'Unnamed: 13': 'ranking_5',
    'Unnamed: 14': 'ranking_6'
})


In [53]:
star_wars.columns

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consid