# Analyzing Survey Data in Python

What is survey data, and how do we determine which statistical test to use to analyze the data? To answer this, it is important to define all sorts of survey data types, encounter important concepts like descriptive and inferential statistics, and visualize survey data to determine the appropriate statistical modeling technique needed. In doing so, it will be possible to know how to best qualitatively and quantitatively define the trends and insights that can come across in surveys.

- Types of Survey Data:
1. Ordinal Data: Responses have a meaningful order (e.g., "Never" to "Always").
1. Nominal Data: Categories without a specific order (e.g., gender, city of birth).
1. Interval Data: Ordered with meaningful and equal intervals (e.g., temperature preferences).
1. Ratio Data: Ordered with a true zero point (e.g., mortgage loan amount).

- Survey Analysis Steps:
1. Define research goals.
1. Determine response rates for representativeness.
1. Learn from feedback to identify key areas for improvement.

- Sampling Techniques:
1. Simple Random Sampling: Randomly selects a subgroup.
1. Stratified Random Sampling: Divides the population into subgroups and samples from each.
1. Weighted Sampling: Matches population demographics.
1. Cluster Sampling: Divides into clusters and samples from them.

## Crosstabulation
Create a cross-tabulation of Gender and Education showing the frequency distribution between both variables.

In [None]:
# Create a cross-tabulation of "Gender" and "Education"
cross_tabulation = pd.crosstab(we_survey.Gender,
                             we_survey.Education, 
                             margins = True)
print(cross_tabulation)

## Descriptive statistics

In [None]:
# Generate descriptive statistics on the non-numerical columns
print(youth_survey_clean.describe(include = np.object))

## Inferential statistics

import scipy.stats 
scipy.stats.norm.interval(alpha, loc, scale)

alpha = confidence level
loc = sample mean
scale = sample std error

## Confidence interval

In [None]:
electric_satisfaction = pd.read_csv( 
  "austin-energy-customer-satisfaction.csv") 

conf_interval = st.norm.interval( 
  alpha = 0.99, 
  loc = np.mean(electric_satisfaction.satisfaction), 
  scale=st.sem(electric_satisfaction.satisfaction))  

print(conf_interval) 

In [None]:
# Find confidence interval for mean height with 95% confidence level
conf_95 = st.norm.interval(alpha = 0.95, 
                           loc = np.mean(youth_survey_clean['Height_in']), 
                           scale = st.sem(youth_survey_clean['Height_in']))

# Find confidence interval for mean height with 99% confidence level
conf_99 = st.norm.interval(alpha = 0.99, 
                           loc = np.mean(youth_survey_clean['Height_in']), 
                           scale = st.sem(youth_survey_clean['Height_in']))
print("conf_95 = ", conf_95)
print("conf_99 = ", conf_99)

## Scatter plot inspection

In [None]:
# Create a scatter plot between `Age` and `Height_in`
youth_survey.plot.scatter( 
x='Age', 
y='Height_in') 
plt.show() 


## Random sample of employees

In [None]:
# Use n parameter to sample 85 employees
employee_satisfaction.sample(n = 85)

# Use `frac` parameter to sample 85 employees
employee_satisfaction.sample(frac = 0.17)

## Reproducible random sampling

In [None]:
# Create a reproducible sample of 25 employees
employee_satisfaction.sample(n = 25, random_state = 123)

## Spread of yes's and no's

In [None]:
# Create a pie chart to the question of working onsite
ABC_survey.onsite_work.value_counts().plot.pie()
plt.show()

## Stratified sampling

In [None]:
# Calculate the percentage of female and male respondents
print(youth_survey.Gender.value_counts(normalize=True) * 100)

# Check the percentage of female and male respondents
youth_survey.Gender.value_counts(normalize=True) * 100

# Proportionally sample out 1% of respondents
sample_dataset = youth_survey.groupby(
  "Gender", group_keys = False).apply(
  lambda x: x.sample(frac=0.01))

print(sample_dataset.head())

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

# Example dataset
data = {'gender': ['male', 'female', 'female', 'male', 'female', 'male', 'female', 'male', 'female', 'male']}
df = pd.DataFrame(data)

# Calculate gender proportions
gender_proportions = df['gender'].value_counts(normalize=True)

# Visualize gender ratio
gender_proportions.plot.pie()
plt.show()

# Perform stratified sampling
sampled_df = df.groupby('gender', group_keys=False).apply(lambda x: x.sample(frac=0.1))
print(sampled_df)

## Blog survey

In [None]:
# Create a crosstab of male and female handedness
survey_crosstab = pd.crosstab(hand_survey['Gender'], 
  hand_survey['Left - right handed'])

# Create a horizontal bar plot of your results
survey_crosstab.plot.barh()
plt.show()

Analyzing youth spending patterns with Python

In [None]:
import pandas as pd 
yp_crosstab = pd.crosstab( 
  yp_survey['Gender'], 
  yp_survey['Entertainment']) 
yp_crosstab 
yp_crosstab.plot.barh() 

In [None]:
survey = yp_survey.groupby(['Gender', 'Entertainment'])['Age'].count().reset_index()  
survey.columns = ['Gender', 'Entertainment', 'Respondents'] 
survey

In [None]:
survey['% total respondents'] = survey.Respondents * 100./survey.Respondents.sum() 

survey['% of population'] = [35, 25, 20, 20] 

survey['Weight'] = survey['% of population']/survey['% total respondents'] 

survey['Weighted Respondents'] = survey.Weight * survey.Respondents 

In [None]:
survey[['Gender','Entertainment', 
        'Respondents','Weighted Respondents']].set_index( 
  ['Gender','Entertainment']).plot.barh() 

In [None]:
# Create a crosstab of male and female handedness
survey_crosstab = pd.crosstab(hand_survey['Gender'], 
  hand_survey['Left - right handed'])

# Create a horizontal bar plot of your results
survey_crosstab.plot.barh()
plt.show()

In [None]:
hand_survey = hand_survey.groupby(
  ['Gender','Left - right handed'])[
  'Age'].count().reset_index()

hand_survey.columns = ['Gender','Handedness','Respondents']

hand_survey['% total respondents'] = hand_survey.Respondents * 100. / hand_survey.Respondents.sum()

hand_survey['% of population'] = [5.5,44.5,6.5,43.5]

hand_survey['Weight'] = hand_survey['% of population'] / hand_survey['% total respondents']

hand_survey['Weighted Respondents'] = hand_survey.Weight * hand_survey.Respondents

# Plot horizontal bar plot of original and weighted sample of each category
hand_survey[['Gender','Handedness','Respondents','Weighted Respondents']].set_index(['Gender','Handedness']).plot.barh()


## Group clusters

In [None]:
# Group population into clusters and calculate respondents
groups = mh_survey.groupby('US_state_live')['sought_treatment'].count().reset_index()
groups.columns = ['US_state_live','count']

# Plot a bar graph of number of respondents by US_state_live
groups.plot.bar(x='US_state_live',y='count')
plt.show()

Choosing clusters

In [None]:
# Create a list of the unique states present in the survey
unique_states = list(set(mh_survey['US_state_live']))

# Randomly choose ten states
random_clusters = np.random.choice(unique_states, size = 10, replace = False)

print(random_clusters)

Cluster sampling analysis

In [None]:
# Subset dataset to inlude only states in cluster_sample
cluster_sample = mh_survey[mh_survey.US_state_live.isin(random_clusters)]

# Create a pie chart of the sought_treament column
treatment_pie = cluster_sample.sought_treatment.value_counts(normalize=True)

treatment_pie.plot.pie()
plt.show()

## Descriptive & Inferential Statistics

Frequency distribution

In [None]:
# Count the number of responses for each age group
freq_dist = dietary_habits.Age.value_counts().to_frame('Number')

# Create a bar chart of the different age groups
freq_dist.plot(kind='bar')
plt.show()

Measures of variability

In [None]:
# Find the range of meals eaten out or ordered
range_value = dietary_habits.eat_out_per_wk.max() - dietary_habits.eat_out_per_wk.min()
print(range_value)

# Find the standard deviation of meals eaten out or ordered
std_dev_value = dietary_habits.eat_out_per_wk.std()
print(std_dev_value)

Measures of central tendency

In [None]:
# Calculate the mean
mean_value = dietary_habits.eat_out_per_wk.mean()
print('mean_value = ', mean_value)

# Calculate the median
median_value = dietary_habits.eat_out_per_wk.median()
print('median_value = ', median_value)

# Calculate the mode
mode_value = dietary_habits.eat_out_per_wk.mode()
print('mode_value = ', mode_value)

Visualize data: histogram

In [None]:
# Histogram of Height column
young_people.Height.plot(kind='hist')
plt.show()

## Find the z-score

In [None]:
# z-score of `Height` column
young_people['Height_zscore'] = stats.zscore(young_people.Height)

# Outliers in `Height_zscore`
height_outliers = young_people[
  (young_people.Height_zscore>=3)
  |(young_people.Height_zscore<=-3)
]
print(height_outliers)

## Analyze variables with .corr()

In [None]:
# Find the correlation between variables present in the survey
cor = birth_survey.birth_age.corr(birth_survey.birth_weight)
print(cor)

## Fitting a linear regression model

In [None]:
# Define variable, x and y
x = salary_survey.experience_years.tolist()
y = salary_survey.current_salary.tolist()

# Add the constant term
x = sm.add_constant(x)

# Perform .OLS() regression and fit
result = sm.OLS(y,x).fit()

# Print the summary table
print(result.summary())

## Visualizing survey data

In [None]:
# Plot the original values using a scatter plot
x = data.experience_years
y = data.current_salary
plt.scatter(x,y)

# Get the range of data
max_x = data.experience_years.max()
min_x = data.experience_years.min()

# Get the range of values
x = np.arange(min_x,max_x,1)
y = 1590.4569 * x + 58080

# Plot the regression line
plt.plot(x, y,'r')
plt.show()

Find the slope and y-intercept values for the regression line between total buyers and year using the OLS regression method.

In [None]:
# Define variable, x and y
x = foot_traffic.year.tolist()
y = foot_traffic.visitors.tolist()

# Add the constant term
x = sm.add_constant(x)

# Perform .OLS() regression and fit
result = sm.OLS(y,x).fit()

# Print the summary table
print(result.summary())


    Plot the original values using a scatter plot.
    Get the range of values for plotting the regression line.
    Plot the regression line.

In [None]:
# Plot the original values using a scatter plot
x = foot_traffic.year
y = foot_traffic.visitors
plt.scatter(x,y)

max_x = foot_traffic.year.max()
min_x = foot_traffic.year.min()

# Get the range of values
x = np.arange(min_x,max_x,1)
y = 5.9786 * x - 11840

# Plot the regression line
plt.plot(x, y,'r')
plt.show()

# Assumptions for a two sample t-test
- Independent
- Normal distribution
- Shapiro-Wilk test stats.shapiro() p-value > 0.05 -> normally distributed
- Equal variances
- Levene test stats.levene() p-value > 0.05 -> equal variances

In [None]:
from scipy.stats import shapiro 
import scipy.stats as stats 

In [None]:
# Calculate mean extroversion of males 
male_mean = male_survey.extroversion.mean()

# Calculate mean extroversion of females 
female_mean = female_survey.extroversion.mean()

# Test normality for males
male_norm = stats.shapiro(male_survey.extroversion)

# Test normality for females
female_norm = stats.shapiro(female_survey.extroversion)

# Test of equal variance
variance_test = stats.levene(male_survey.extroversion,female_survey.extroversion)
print(variance_test)

Steps of chi-square analysis in python

In [None]:
# Perform cross tabulation and assign to variable, cross_table
cross_table = pd.crosstab(tech_mh.anonymity_protected, tech_mh.discuss_with_supervisor)

# Conduct the chi-square analysis
chi_analysis = st.chi2_contingency(cross_table)

print(chi_analysis)

In [None]:
# Perform cross tabulation and assign to variable, cross_table
cross_table = pd.crosstab(tech_mh.work_remotely, tech_mh.current_condition)

# Conduct chi-square analysis
chi_analysis = st.chi2_contingency(cross_table)

# Record p-value and expected frequencies
p_value = chi_analysis[1]
expected_freq = chi_analysis[3]

print(p_value)
print(expected_freq)