# What is the most popular Star Wars Movie?

## Introduction

This note book is an analysis of the start wars surve to find out the most popular star war movie.

## Data Collection

They surveyed Star Wars fans using the online tool SurveyMonkey, they received 835 total responses, and uploaded to their github repository.

When we read the data, we need to specify an encoding method, because dataset has some characters that aren't in Python's default utf-8 encoding.

## Data Explorartion

### Initial Observations

Before diving into detailed analysis, we will first explore the **general information** of the dataset to understand its **structure** and identify any initial issues or patterns. This will help guide our **data cleaning** and **preparation** steps.

In [160]:
import seaborn as sns
import pandas as pd 
import numpy as np
import sys
import os

In [161]:
# load the dataset
df = pd.read_csv('starWars.csv', encoding='ISO-8859-1') # ISO-8859-1 encoding to handle special characters

In [162]:
# print the general information of the dataset
print('*' * 100)
print('The information of the dataset is as follows:')
print(df.info())

****************************************************************************************************
The information of the dataset is as follows:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   float64
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1187 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars

At first glance, we notice the following about the dataset:
- There are **38 features** — we need to simplify them and identify the most important ones.
- Some features are **unnamed** — we should determine what these columns represent.
- Except for the response ID (a float), all other columns are **strings**, so the format is consistent.
- Several columns contain **many null values**, indicating that significant data preprocessing will be needed.

Overall, this dataset will require substantial cleaning.  
Our plan is to simplify the dataset for easier analysis:
1. **Convert "yes" or "no" answers to 1 and 0**.
2. Investigate unnamed columns, as some may be **one-hot encoded**.

### Samples

From our initial observations, it’s clear that the dataset has a **complex structure** and will require **cleaning** and **restructuring**.

After gaining a general understanding of the dataset’s structure, we can now look at some sample entries.

In [163]:
# see some examples of the dataset
df.head()


Unnamed: 0,RespondentID,Have you seen any of the 6 films in the Star Wars franchise?,Do you consider yourself to be a fan of the Star Wars film franchise?,Which of the following Star Wars films have you seen? Please select all that apply.,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.,...,Unnamed: 28,Which character shot first?,Are you familiar with the Expanded Universe?,Do you consider yourself to be a fan of the Expanded Universe?æ,Do you consider yourself to be a fan of the Star Trek franchise?,Gender,Age,Household Income,Education,Location (Census Region)
0,,Response,Response,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,Star Wars: Episode I The Phantom Menace,...,Yoda,Response,Response,Response,Response,Response,Response,Response,Response,Response
1,3292880000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,3,...,Very favorably,I don't understand this question,Yes,No,No,Male,18-29,,High school degree,South Atlantic
2,3292880000.0,No,,,,,,,,,...,,,,,Yes,Male,18-29,"$0 - $24,999",Bachelor degree,West South Central
3,3292765000.0,Yes,No,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,,,,1,...,Unfamiliar (N/A),I don't understand this question,No,,No,Male,18-29,"$0 - $24,999",High school degree,West North Central
4,3292763000.0,Yes,Yes,Star Wars: Episode I The Phantom Menace,Star Wars: Episode II Attack of the Clones,Star Wars: Episode III Revenge of the Sith,Star Wars: Episode IV A New Hope,Star Wars: Episode V The Empire Strikes Back,Star Wars: Episode VI Return of the Jedi,5,...,Very favorably,I don't understand this question,No,,Yes,Male,18-29,"$100,000 - $149,999",Some college or Associate degree,West North Central


From the samples above, it is clear that this dataset requires significant cleaning. **It contains a mix of multiple-choice answers, yes-or-no responses, and one-hot encoded columns.**

### Unique Values

From the sample rows, we can see that some columns are simple **yes-or-no questions**, some are **one-hot encoded** columns, and others are **multiple choice**. To simplify the dataset, we will ignore the one-hot encoded columns and **focus on exploring the unique values** in each of the remaining columns.

In [164]:
# display the unique answers of each column
print('*' * 100)

# category the columns
one_hot_encoding_columns = ["Which of the following Star Wars films have you seen? Please select all that apply.",
                           "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.",
                           "Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her."]

# loop through all the columns and then print the unique vallues
for col in df.columns:
    if not col.startswith('Unnamed') and col not in one_hot_encoding_columns:
        print(f"Unique values in '{col}':")
        print(df[col].unique())
        print('*' * 100)

****************************************************************************************************
Unique values in 'RespondentID':
[           nan 3.29288000e+09 3.29287954e+09 ... 3.28837529e+09
 3.28837307e+09 3.28837292e+09]
****************************************************************************************************
Unique values in 'Have you seen any of the 6 films in the Star Wars franchise?':
['Response' 'Yes' 'No']
****************************************************************************************************
Unique values in 'Do you consider yourself to be a fan of the Star Wars film franchise?':
['Response' 'Yes' nan 'No']
****************************************************************************************************
Unique values in 'Which character shot first?':
['Response' "I don't understand this question" nan 'Greedo' 'Han']
****************************************************************************************************
Unique values in 'Are you f

**From the unique values, we can see that there are basically two types of columns: "yes or no" and "multiple choice".**

---

**Yes or no columns:**
- Have you seen any of the 7 films in the Star Wars franchise?
- Do you consider yourself to be a fan of the Star Wars film franchise?
- Are you familiar with the Expanded Universe?
- Do you consider yourself to be a fan of the Expanded Universe?
- Do you consider yourself to be a fan of the Star Trek franchise?

**Multiple choice columns:**
- Gender — the respondent's gender
- Age — the respondent's age group
- Household Income — the respondent's income bracket
- Education — the respondent's education level
- Location — the respondent's census region