## Star Wars Fan Survey


The project was created to answer the question:

**does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?**

As a result, the dataset contains survey information from Star Wars fans. 835 surveys were collected.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#star_wars = pd.read_csv("star_wars.csv")
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

In [3]:
#Column names
print("Amount of columns: ", len(star_wars.columns))
print("\nColumn names:\n", star_wars.columns)

Amount of columns:  38

Column names:
 Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',


In [4]:
print(star_wars.shape)
star_wars.head(3)

(1187, 38)


Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,3292880000.0,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central


### Modifying RespondentID

* The dataset needs to be cleaned to make it easy to represent in column format

**RespondentID** column represents a unique ID for each respondent, but some rows have blank RespondentID, so they need to be removed

In [5]:
#NaN values in RespondentID
star_wars["RespondentID"].isnull().value_counts()

False    1186
True        1
Name: RespondentID, dtype: int64

* 1 row needs to be removed

In [6]:
#Removing rows where "RespondentID" has NaN value
print("Number of rows with 'Respondent ID' NaN values: ", star_wars.shape[0])
star_wars = star_wars[star_wars["RespondentID"].notnull()]
print("Number of rows without 'Respondent ID' NaN values: ", star_wars.shape[0])

Number of rows with 'Respondent ID' NaN values:  1187
Number of rows without 'Respondent ID' NaN values:  1186


### Modifying Yes/No questions

Two columns ask the following questions:

1. Have you seen any of the 6 films in the Star Wars franchise? 
2. Do you consider yourself to be a fan of the Star Wars film franchise?
These two columns have answers as Yes, No, or NaN.

It is convenient and more useful to convert these string values into Boolean (True/False/NaN) values

In [7]:
#question 1
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].value_counts(dropna=False)

Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

* question 1 has no null values

In [8]:
#question 2
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].value_counts(dropna=False)

Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

In [9]:
for row in star_wars["Have you seen any of the 6 films in the Star Wars franchise?"]:
    if row == "Yes":
        star_wars["Have you seen any of the 6 films in the Star Wars franchise?"][row]=True

In [10]:
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].value_counts()

Yes     936
No      250
True      1
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64