# Star Wars Survey

Corresponds to DataQuest guided project.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from pathlib import Path

In [2]:
data_path = Path.home() / "datasets" / "tabular_practice"

star_wars = pd.read_csv(data_path / "star_wars.csv", encoding="ISO-8859-1")
star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   int64  
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1186 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                                                                          836 non-null    object 
 3   Which of the following Star 

In [3]:
star_wars.head()

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,3292879998,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3.0,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
1,3292879538,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
2,3292765271,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1.0,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
3,3292763116,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central
4,3292731220,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5.0,...,Somewhat favorably,Greedo,Yes,No,No,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


In [4]:
list(enumerate(star_wars.columns))

[(0, 'RespondentID'),
 (1, 'Have you seen any of the 6 films in the Star Wars franchise?'),
 (2, 'Do you consider yourself to be a fan of the Star Wars film franchise?'),
 (3,
  'Which of the following Star Wars films have you seen? Please select all that apply.'),
 (4, 'Unnamed: 4'),
 (5, 'Unnamed: 5'),
 (6, 'Unnamed: 6'),
 (7, 'Unnamed: 7'),
 (8, 'Unnamed: 8'),
 (9,
  'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.'),
 (10, 'Unnamed: 10'),
 (11, 'Unnamed: 11'),
 (12, 'Unnamed: 12'),
 (13, 'Unnamed: 13'),
 (14, 'Unnamed: 14'),
 (15,
  'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.'),
 (16, 'Unnamed: 16'),
 (17, 'Unnamed: 17'),
 (18, 'Unnamed: 18'),
 (19, 'Unnamed: 19'),
 (20, 'Unnamed: 20'),
 (21, 'Unnamed: 21'),
 (22, 'Unnamed: 22'),
 (23, 'Unnamed: 23'),
 (24, 'Unnamed: 24'),
 (25, 'Unnamed: 25'),
 (26, 'Unnamed: 26'),
 (27, 

In [5]:
for i in range(3, 9):
    print(f"[{i}]:")
    print(star_wars.iloc[:, i].value_counts(dropna=False))

[3]:
Which of the following Star Wars films have you seen? Please select all that apply.
Star Wars: Episode I  The Phantom Menace    673
NaN                                         513
Name: count, dtype: int64
[4]:
Unnamed: 4
NaN                                            615
Star Wars: Episode II  Attack of the Clones    571
Name: count, dtype: int64
[5]:
Unnamed: 5
NaN                                            636
Star Wars: Episode III  Revenge of the Sith    550
Name: count, dtype: int64
[6]:
Unnamed: 6
Star Wars: Episode IV  A New Hope    607
NaN                                  579
Name: count, dtype: int64
[7]:
Unnamed: 7
Star Wars: Episode V The Empire Strikes Back    758
NaN                                             428
Name: count, dtype: int64
[8]:
Unnamed: 8
Star Wars: Episode VI Return of the Jedi    738
NaN                                         448
Name: count, dtype: int64


In [6]:
new_columns = [
    "respondent_id",
    "seen_any_of_6_films",
    "fan_of_franchise",
    "seen_episode1",
    "seen_episode2",
    "seen_episode3",
    "seen_episode4",
    "seen_episode5",
    "seen_episode6",
]

film_titles = pd.Series(
    [
        "Star Wars: Episode I. The Phantom Menace",
        "Star Wars: Episode II. Attack of the Clones",
        "Star Wars: Episode III. Revenge of the Sith",
        "Star Wars: Episode IV  A New Hope",
        "Star Wars: Episode V. The Empire Strikes Back",
        "Star Wars: Episode VI. Return of the Jedi",
    ]
)

star_wars.iloc[:, 1] = star_wars.iloc[:, 1] == "Yes"

In [7]:
nz_ind = star_wars.iloc[:, 2].notnull()
name = star_wars.columns[2]
star_wars.loc[nz_ind, name] = star_wars.loc[nz_ind, name] == "Yes"
star_wars.iloc[:, :3].describe(include="all")

Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?
count,1186.0,1186,836
unique,,2,2
top,,True,True
freq,,936,552
mean,3290128000.0,,
std,1055639.0,,
min,3288373000.0,,
25%,3289451000.0,,
50%,3290147000.0,,
75%,3290814000.0,,


In [8]:
star_wars.iloc[:, 3:9] = star_wars.iloc[:, 3:9].notnull()
star_wars.iloc[:, :9].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 9 columns):
 #   Column                                                                               Non-Null Count  Dtype 
---  ------                                                                               --------------  ----- 
 0   RespondentID                                                                         1186 non-null   int64 
 1   Have you seen any of the 6 films in the Star Wars franchise?                         1186 non-null   object
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                836 non-null    object
 3   Which of the following Star Wars films have you seen? Please select all that apply.  1186 non-null   object
 4   Unnamed: 4                                                                           1186 non-null   object
 5   Unnamed: 5                                                                           1186 non-null

In [9]:
for i in range(9, 15):
    print(f"[{i}]:")
    print(star_wars.iloc[:, i].value_counts(dropna=False))

[9]:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
NaN    351
4.0    237
6.0    168
3.0    130
1.0    129
5.0    100
2.0     71
Name: count, dtype: int64
[10]:
Unnamed: 10
NaN    350
5.0    300
4.0    183
2.0    116
3.0    103
6.0    102
1.0     32
Name: count, dtype: int64
[11]:
Unnamed: 11
NaN    351
6.0    217
5.0    203
4.0    182
3.0    150
2.0     47
1.0     36
Name: count, dtype: int64
[12]:
Unnamed: 12
NaN    350
1.0    204
6.0    161
2.0    135
4.0    130
3.0    127
5.0     79
Name: count, dtype: int64
[13]:
Unnamed: 13
NaN    350
1.0    289
2.0    235
5.0    118
3.0    106
4.0     47
6.0     41
Name: count, dtype: int64
[14]:
Unnamed: 14
NaN    350
2.0    232
3.0    220
1.0    146
6.0    145
4.0     57
5.0     36
Name: count, dtype: int64


In [10]:
new_columns += [f"rank_episode{i}" for i in range(1, 7)]

In [11]:
questions_survey = pd.Series(
    [
        "Have you seen any of the 6 films in the Star Wars franchise?",
        "Do you consider yourself to be a fan of the Star Wars film franchise?",
        "Which of the following Star Wars films have you seen? Please select all that apply.",
        "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.",
        "Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.",
        "Which character shot first?",
        "Are you familiar with the Expanded Universe?",
        "Do you consider yourself to be a fan of the Expanded Universe?",
        "Do you consider yourself to be a fan of the Star Trek franchise?",
    ]
)

In [12]:
map_vals = {
    "Very favorably": "favorable",
    "Somewhat favorably": "favorable",
    "Neither favorably nor unfavorably (neutral)": "neutral",
    "Somewhat unfavorably": "unfavorable",
    "Very unfavorably": "unfavorable",
    "Unfamiliar (N/A)": "unfamiliar",
}

favorable_hist = []
for i in range(15, 29):
    nz_ind = star_wars.iloc[:, i].notnull()
    name = star_wars.columns[i]
    mapped = star_wars.loc[nz_ind, name].apply(lambda x: map_vals[x])
    hist = mapped.value_counts(normalize=True)
    favorable_hist.append((i, hist["favorable"], hist))

for i, _, hist in sorted(favorable_hist, key=lambda x: x[1], reverse=True):
    print(f"[{i}]:")
    print(hist)

[16]:
Unnamed: 16
favorable      0.927798
neutral        0.045728
unfavorable    0.019254
unfamiliar     0.007220
Name: proportion, dtype: float64
[15]:
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.
favorable      0.917973
neutral        0.053076
unfamiliar     0.018094
unfavorable    0.010856
Name: proportion, dtype: float64
[17]:
Unnamed: 17
favorable      0.910951
neutral        0.057762
unfavorable    0.021661
unfamiliar     0.009627
Name: proportion, dtype: float64
[19]:
Unnamed: 19
favorable      0.909091
neutral        0.052121
unfamiliar     0.020606
unfavorable    0.018182
Name: proportion, dtype: float64
[28]:
Unnamed: 28
favorable      0.906780
neutral        0.061743
unfavorable    0.019370
unfamiliar     0.012107
Name: proportion, dtype: float64
[25]:
Unnamed: 25
favorable      0.900000
neutral        0.068675
unfavorable    0.019277
unfamiliar     0.012048
Name: proportion, dtype: float64
[24]:
Unnamed: 24
f

In [13]:
# We match the distributions to the plots in
#   https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/
# in order to figure out the character names.

characters = [
    (16, "Luke Skywalker"),
    (15, "Han Solo"),
    (17, "Princess Leia Organa"),
    (19, "Obi Wan Kenobi"),
    (28, "Yoda"),
    (25, "R2-D2"),
    (24, "C-3PO"),
    (18, "Anakin Skywalker"),
    (21, "Darth Vader"),
    (22, "Lando Calrissian"),
    (27, "Padme Amidala"),
    (23, "Boba Fett"),
    (20, "Emperor Palpatine"),
    (26, "Jar Jar Binks"),
]

names = [
    "view_" + x[1].lower().replace(" ", "_").replace("-", "_")
    for x in sorted(characters, key=lambda y: y[0])
]
new_columns += names
new_columns

['respondent_id',
 'seen_any_of_6_films',
 'fan_of_franchise',
 'seen_episode1',
 'seen_episode2',
 'seen_episode3',
 'seen_episode4',
 'seen_episode5',
 'seen_episode6',
 'rank_episode1',
 'rank_episode2',
 'rank_episode3',
 'rank_episode4',
 'rank_episode5',
 'rank_episode6',
 'view_han_solo',
 'view_luke_skywalker',
 'view_princess_leia_organa',
 'view_anakin_skywalker',
 'view_obi_wan_kenobi',
 'view_emperor_palpatine',
 'view_darth_vader',
 'view_lando_calrissian',
 'view_boba_fett',
 'view_c_3po',
 'view_r2_d2',
 'view_jar_jar_binks',
 'view_padme_amidala',
 'view_yoda']

In [14]:
new_columns += ["who_shot_first"]

slice = star_wars.iloc[:, 30:33].copy()
nz_ind = slice.notnull()
slice[nz_ind] = slice[nz_ind] == "Yes"
slice

Unnamed: 0,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?,Do you consider yourself to be a fan of the Star Trek franchise?
0,True,False,False
1,,,True
2,False,,False
3,False,,True
4,True,False,False
...,...,...,...
1181,False,,True
1182,False,,True
1183,,,False
1184,False,,True


In [15]:
star_wars.iloc[:, 30:33] = slice
new_columns += ["familiar_expanded_universe", "fan_of_expanded_universe", "fan_of_star_trek_franchise"]

In [16]:
new_columns += list(star_wars.columns[33:37].str.lower().str.replace(" ", "_"))
new_columns

['respondent_id',
 'seen_any_of_6_films',
 'fan_of_franchise',
 'seen_episode1',
 'seen_episode2',
 'seen_episode3',
 'seen_episode4',
 'seen_episode5',
 'seen_episode6',
 'rank_episode1',
 'rank_episode2',
 'rank_episode3',
 'rank_episode4',
 'rank_episode5',
 'rank_episode6',
 'view_han_solo',
 'view_luke_skywalker',
 'view_princess_leia_organa',
 'view_anakin_skywalker',
 'view_obi_wan_kenobi',
 'view_emperor_palpatine',
 'view_darth_vader',
 'view_lando_calrissian',
 'view_boba_fett',
 'view_c_3po',
 'view_r2_d2',
 'view_jar_jar_binks',
 'view_padme_amidala',
 'view_yoda',
 'who_shot_first',
 'familiar_expanded_universe',
 'fan_of_expanded_universe',
 'fan_of_star_trek_franchise',
 'gender',
 'age',
 'household_income',
 'education']

In [21]:
star_wars["Education"].value_counts(dropna=False)

Education
Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
NaN                                 150
High school degree                  105
Less than high school degree          7
Name: count, dtype: int64

In [23]:
star_wars.iloc[:, 37].value_counts(dropna=False)

Location (Census Region)
East North Central    181
Pacific               175
South Atlantic        170
NaN                   143
Middle Atlantic       122
West South Central    110
West North Central     93
Mountain               79
New England            75
East South Central     38
Name: count, dtype: int64

In [24]:
new_columns += ["location"]

star_wars.columns = new_columns
star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 38 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   respondent_id               1186 non-null   int64  
 1   seen_any_of_6_films         1186 non-null   object 
 2   fan_of_franchise            836 non-null    object 
 3   seen_episode1               1186 non-null   object 
 4   seen_episode2               1186 non-null   object 
 5   seen_episode3               1186 non-null   object 
 6   seen_episode4               1186 non-null   object 
 7   seen_episode5               1186 non-null   object 
 8   seen_episode6               1186 non-null   object 
 9   rank_episode1               835 non-null    float64
 10  rank_episode2               836 non-null    float64
 11  rank_episode3               835 non-null    float64
 12  rank_episode4               836 non-null    float64
 13  rank_episode5               836 n

In [25]:
star_wars.to_csv(data_path / "star_wars_cleaned.csv")

Next, we could try to understand the pattern of missing values.

In [26]:
star_wars[["fan_of_franchise", "fan_of_star_trek_franchise"]]

Unnamed: 0,fan_of_franchise,fan_of_star_trek_franchise
0,True,False
1,,True
2,False,False
3,True,True
4,True,False
...,...,...
1181,True,True
1182,True,True
1183,,False
1184,True,True


In [27]:
seen_names = [x for x in star_wars.columns if x.startswith("seen_ep")]
seen_any = star_wars.loc[:, seen_names].any(axis=1)
(star_wars["seen_any_of_6_films"] == seen_any).sum()

1085

In [28]:
wrong_ind = (star_wars["seen_any_of_6_films"] != seen_any)
star_wars.loc[wrong_ind, seen_names + ["seen_any_of_6_films"]]

Unnamed: 0,seen_episode1,seen_episode2,seen_episode3,seen_episode4,seen_episode5,seen_episode6,seen_any_of_6_films
10,False,False,False,False,False,False,True
80,False,False,False,False,False,False,True
96,False,False,False,False,False,False,True
105,False,False,False,False,False,False,True
127,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...
1084,False,False,False,False,False,False,True
1108,False,False,False,False,False,False,True
1139,False,False,False,False,False,False,True
1141,False,False,False,False,False,False,True


In [29]:
star_wars.loc[wrong_ind, seen_names + ["seen_any_of_6_films"]].value_counts()

seen_episode1  seen_episode2  seen_episode3  seen_episode4  seen_episode5  seen_episode6  seen_any_of_6_films
False          False          False          False          False          False          True                   101
Name: count, dtype: int64

For 101 rows, "seen_any_of_6_films" is `True`, but "seen_episode*" are all `False`.

In [31]:
seen_per_age = star_wars.pivot_table(seen_names, "age", margins=True)
seen_per_age

Unnamed: 0_level_0,seen_episode1,seen_episode2,seen_episode3,seen_episode4,seen_episode5,seen_episode6
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18-29,0.733945,0.678899,0.665138,0.697248,0.733945,0.733945
30-44,0.652985,0.589552,0.567164,0.656716,0.735075,0.735075
45-60,0.621993,0.508591,0.487973,0.56701,0.756014,0.721649
> 60,0.531599,0.394052,0.371747,0.386617,0.624535,0.587361
All,0.630019,0.535373,0.515296,0.570746,0.712237,0.693117


In [35]:
film_titles

0         Star Wars: Episode I. The Phantom Menace
1      Star Wars: Episode II. Attack of the Clones
2      Star Wars: Episode III. Revenge of the Sith
3                Star Wars: Episode IV. A New Hope
4    Star Wars: Episode V. The Empire Strikes Back
5        Star Wars: Episode VI. Return of the Jedi
dtype: object

In general, episode V has been seen by most, followed by episode VI. Younger people are more likely to have seen episodes I to IV.

In [36]:
seen_per_education = star_wars.pivot_table(seen_names, "education", margins=True)
seen_per_education

Unnamed: 0_level_0,seen_episode1,seen_episode2,seen_episode3,seen_episode4,seen_episode5,seen_episode6
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bachelor degree,0.641745,0.529595,0.507788,0.607477,0.757009,0.728972
Graduate degree,0.650909,0.541818,0.505455,0.592727,0.752727,0.730909
High school degree,0.542857,0.457143,0.457143,0.504762,0.580952,0.571429
Less than high school degree,0.428571,0.428571,0.428571,0.428571,0.428571,0.428571
Some college or Associate degree,0.643293,0.567073,0.557927,0.54878,0.692073,0.679878
All,0.633205,0.53668,0.517375,0.573359,0.715251,0.695946
