# Motivation to Take Language Courses
---
We are all more connected to each other than ever before because of various things such as the social media, articles, and even the news; therefore because of this, exposure to various different languages become everso more common. Let’s look at  KPOP in recent years; it has had an explosion in popularity in America culture despite most of the media being in Korean rather than English. With over 58% of the student population at UCSD identifying with ethnicities where the ethnic language is not English, hence our interest lies within the motivators which bring students within our body to learn additional languages. Specifically our question is what motivators drive the acquisition of language for students currently/had taking language courses at UCSD? Are there differences in these motivators which effect acquisition and post-course language retention?Some of our factors include things like "Familial intentions, cultural appreciation, desire to better explore, or even "I want to get through my GEs"


## Loading in our Data
---
Our survey was conducted and modeled after the Attitude/Motivation Test Battery first designed by Robert Gardner in the context of studying people's motivation to learn secondary languages.

Our survey was adapted to the needs of the UCSD student demographic who were currently taking language courses as those students were the ones who aligned best with our research topic.

Additional points of inspiration for our survey came from the premise of **Instrumental** and **Integrative** Motivators, 

In [2]:
# Import packages for data analysis

# Data organization packages
import pandas as pd
import numpy  as np
import json
import sys 

# Plotting packages
import matplotlib.pyplot as plt
import seaborn           as sns
custom_style = {"axes.spines.right": False, "axes.spines.top": False,}
sns.set_style("ticks", rc=custom_style)

# Statistical packages
from scipy.cluster           import hierarchy
from sklearn                 import preprocessing
from sklearn.cluster         import KMeans
from sklearn.metrics         import mean_squared_error as mse
from sklearn.decomposition   import PCA
from statsmodels.miscmodels.ordinal_model import OrderedModel

import pingouin              as pg

# Other packages
from IPython.display import clear_output

In [3]:
# Define all the prior column names and the new column names
prior_cols = [
    'Timestamp', 
    'Score', 
    'What is your age in years?',
    'What gender do you identify as?', 
    'What race(s) do you identify with?',
    'If you selected "Other", please specify below. If not, please leave this question blank.',
    'Are you an international or domestic student?',
    'What is your primary language?',
    'How many languages are you fluent in? (Include your primary language to the count)',
    'Which language class are you currently taking?',
    'How many years have you been learning/utilizing your learned language?',
    'Do you speak this learned language at home?',
    'Is this learned language spoken in your home?',
    'What learning strategies do you utilize to learn the language? [Attending Class]',
    'What learning strategies do you utilize to learn the language? [Participating in Class]',
    'What learning strategies do you utilize to learn the language? [Apps or Software]',
    'What learning strategies do you utilize to learn the language? [Practicing with other Individuals]',
    'What learning strategies do you utilize to learn the language? [Listening to Audio Material]',
    'What learning strategies do you utilize to learn the language? [Reading Books or Articles]',
    'What learning strategies do you utilize to learn the language? [Watching movies or TV shows]',
    'In the last 30 days, how many times have you held a short conversation (~1-5 min) with someone outside of class in your learned language?',
    'In the last 30 days, how many times have you read something outside of class responsibilities in your learned language?',
    'In the last 30 days, how many times have you watched/listened to something outside of class responsibilities in your learned language?',
    'In the last 30 days, how many times have you gone out of your way to interact with others using your learned language?',
    'In the last 30 days, how many times have you come across an opportunity to use your learned language in your daily life?',
    'Do you use Duolingo or any other application for your learned language?',
    'In the last 30 days, how many times have you used this application?',
    'How much do you agree with the statement, "I have not learned enough to use my learned language."',
    'How much do you agree with the statement, "I have learned enough to hold small conversations with native speakers."',
    'How much do you agree with the statement, "I have learned enough to hold adequate conversations with native speakers."',
    'How much do you agree with the statement, "I have learned enough to be considered fluent."',
    'How much would you agree with the statement, "I would like to continue learning this language."',
    'How much would you agree with the statement, "I would like to continue learning my learned language within a structured environment."',
    'How much do you agree with the statement, "I want to become certified fluent within my learned language."',
    'How comfortable are you speaking your learned language?',
    'How comfortable are you reading in your learned language?',
    'How comfortable are you writing in your learned language?',
    'How comfortable are you understanding your learned language when spoken by others?',
    'What grade do you currently have in the class?',
    'What is your expected grade in this class?',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [Family]',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [Media Consumption]',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [General Education]',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [Cultural Appreciation]',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [Self-Improvement]',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [Global Citizenship]',
    'Why are you learning your learned language? (Please rank in order from strongest to weakest motivator) [Other]',
    'If you placed "Other" higher than 7th place for the previous question, please specify here:',
    'I wish I could speak many foreign languages perfectly.',
    'Learning my learned language is really great.',
    'I would get nervous if I had to speak my learned language to a stranger.',
    'I make a point of trying to understand all of the learned language that I see and hear.',
    'Studying my learned language is important because I will need it for my career.',
    'I wish I could read newspapers and magazines in many foreign languages.',
    'Studying my learned language is important because it will allow me to meet and converse with more and varied people.',
    'I have a strong desire to know all aspects of my learned language.',
    'Studying my learned language is important because it will make me more educated.',
    'I wish I could have many friends who natively speak my learned language.',
    'I would really like to learn many foreign languages.',
    'My family have stressed the importance my learned language will have for me when I leave school.',
    'Studying my learned language is important because it will enable me to better understand and appreciate their cultural way of life.',
    'My family feel that I should continue studying my learned language all through school.',
    'My family feel that it is very important for me to learn my learned language.',
    'I want to learn my learned language so well that it will become natural to me.',
    'Studying my learned language is important because it will be useful in getting a good job.',
    'Studying my learned language is important because I will be able to interact more easily with other speakers.',
    'I wish I were fluent in my learned language.',
    'I would rather see a TV program in its own language than dubbed into my primary language.',
    'I often find instances where I wish I could read my learned language.',
    'I feel like I am required to take my language class.',
    'If given the opportunity to choose, I would not learn my learned language again.',
    'If given the opportunity to choose, I would learn my learned language again.',
    'I find my learned language difficult to learn.',
    'My motivation to learn my learned language in order to communicate with other speakers is:',
    'My motivation to learn my learned language for practical purposes (e.g., to get a good job) is:',
    'My family encourage me to learn my learned language:',
    'Extra Credit sign in: Please select the class you are currently taking if eligible for extra credit',
    'Extra Credit Sign In: Please type in your student ID if you are extra credit eligible. This information is solely for your instructor to identify who has completed the survey for extra credit.'
]

new_cols = [
    'Timestamp',
    'Score',
    'demo_age',
    'demo_gender',
    'demo_race',
    'demo_race_other',
    'demo_domestic',
    'demo_primary_lang',
    'demo_num_lang',
    'demo_class',
    'demo_years_learning',
    'demo_home_speaker',
    'demo_home_spoken',
    'engage_attend_class',
    'engage_participate_class',
    'engage_apps',
    'engage_practice_others',
    'engage_listen',
    'engage_read',
    'engage_watch',
    'use_short_conv',
    'use_read',
    'use_watch',
    'use_interact',
    'use_opportunity_use',
    'use_duolingo',
    'use_duolingo_usage',
    'feel_not_learned_enough',
    'feel_small_conversations',
    'feel_adequate_conversations',
    'feel_considered_fluent',
    'feel_continue_learning',
    'feel_continue_structured',
    'feel_certified_fluent',
    'feel_comfortable_speaking',
    'feel_comfortable_reading',
    'feel_comfortable_writing',
    'feel_comfortable_listening',
    'feel_current_grade',
    'feel_expected_grade',
    'rank_family',
    'rank_media',
    'rank_education',
    'rank_culture',
    'rank_improvement',
    'rank_citizenship',
    'rank_other',
    'rank_other_text',
    'motivator_speak',
    'motivator_great',
    'motivator_nervous',
    'motivator_understand',
    'motivator_career',
    'motivator_newspapers',
    'motivator_converse',
    'motivator_aspects',
    'motivator_educated',
    'motivator_friends',
    'motivator_many',
    'motivator_family_stressed',
    'motivator_understand_cultural',
    'motivator_family_continue',
    'motivator_family_important',
    'motivator_natural',
    'motivator_job',
    'motivator_interact',
    'motivator_fluent',
    'motivator_tv',
    'motivator_read',
    'motivator_required',
    'motivator_choose_not',
    'motivator_choose',
    'motivator_difficult',
    'motivator_communicate',
    'motivator_practical',
    'motivator_family_encourage',
    'ec_class',
    'ec_id',
]

In [4]:
# Merge the two lists as a dictionary where the keys are the new cols and values are the old cols
col_dict = dict(zip(prior_cols, new_cols))

# Add a confirmation to prevent overwriting the original file
print('This will overwrite the original file. Are you sure you want to continue?')
confirmation = input('Type "yes" to continue: ')

if confirmation == 'yes':
    # Export the dictionary to a json file
    with open('col_dict.json', 'w') as fp:
        json.dump(col_dict, fp)
    print('File saved')
else:
    print('Exiting without saving.')

This will overwrite the original file. Are you sure you want to continue?
Exiting without saving.


In [5]:
# Create a different dictionary which flips the values as a "decoding"
decode_dict = dict(zip(new_cols, prior_cols))

# Add a confirmation to prevent overwriting the original file
print('This will overwrite the original file. Are you sure you want to continue?')
confirmation = input('Type "yes" to continue: ')

if confirmation == 'yes':
    # Export the dictionary to a json file
    with open('decode_dict.json', 'w') as fp:
        json.dump(decode_dict, fp)
    print('File saved')
else:
    print('Exiting without saving.')

This will overwrite the original file. Are you sure you want to continue?
Exiting without saving.


In [6]:
# Load in data from the csv
df = pd.read_csv('survey_responses.csv')

# Remove leading and trailing whitespace from all columns
df = df.rename( 
    columns=lambda x: x.strip()
)

# Rename the columns
df = df.rename(columns=col_dict)

### Getting our Respondents Credit
---
This next section isn't actually relevant to our analysis -- this is just to give our participants credit for completing our survey :D

In [None]:
# Create a new dataframe of the first column and last two columns
df_ec = df[['Timestamp', 'ec_class', 'ec_id']]

# Drop empty or NaN rows which are empty for the last two columns
df_ec = df_ec.dropna()

In [None]:
# Get the short class names
ec_classes = ['LISP_1A', 'LISP_1D', 'LISP_1C', 'JAPN_10C', 
              'LTKO_2C', 'JAPN_20C', 'CHIN_10CD', 'CHIN_20CD',
              'CHIN_100CN', 'LISP17', 'LISP18']

# Get the class names within the dataframe
actual_classes = df_ec['ec_class'].unique()

# Replace the long class names with the short class names
df_ec['ec_class'] = df_ec['ec_class'].replace(
    to_replace=actual_classes,
    value=ec_classes
)

# Split the df_ec into various different dataframes based on the class
df_ecs = [df_ec[df_ec['ec_class'] == ec_class] for ec_class in ec_classes]

In [None]:
# Export the dataframes to csv files
for i, df_ec in enumerate(df_ecs):
    df_ec.to_csv(f'ec_{ec_classes[i]}.csv', index=False)

## Getting on to our real data cleaning
---
Now that we have given our participants their credit, let's clean the data.

1. Validate responses are consistent. Points of concern are people who reported other to specific responses or put that they listed 0 as the number of languages they are fluent in.
2. Standardized responses within text fields. Some of our text fields such as classes are somewhat verbose, so trimming them down makes it more readable. 
3. Remove unnecessary columns.

### Validating Race Replies
---
Here we will be gathering all the responses where people listed their race as other or not listed on the list, then applying the races they stated they were back into their list of identified races.

In [None]:
# Get all the races which were labeled as not currently in there
other_races = df.loc[df['demo_race_other'].notnull(), 'demo_race_other'].tolist()

# Get all the other responses which were labeled as not currently in there
responded_races = df.loc[df['demo_race_other'].notnull(), 'demo_race'].tolist()

# Remove the response 'Race or Ethnicity not listed here' from the list of responses
for idx in range(len(responded_races)):
    
    # replace 'and' with ',' within the response
    if 'and' in responded_races[idx]:
        responded_races[idx] = responded_races[idx].replace(' and', ',')

    # Split the response into a list
    responded_races[idx] = responded_races[idx].split(', ')
    
    # Get rid of the white space within each response
    responded_races[idx] = [x.strip() for x in responded_races[idx]]
    
    # Get rid of the extra response
    if 'Race or Ethnicity not listed here' in responded_races[idx]: 
        responded_races[idx].remove('Race or Ethnicity not listed here')
    
    # Add the other race in if it was not already present
    if 'and' in other_races[idx]:
        other_races[idx] = other_races[idx].split(' and ')
        for race in other_races[idx]:
            if race not in responded_races[idx]: responded_races[idx].append(race)
    
    elif other_races[idx] not in responded_races[idx]:
        responded_races[idx].append(other_races[idx])
    
    responded_races[idx] = ', '.join(responded_races[idx])

print(responded_races)

In [None]:
# Apply the new responses to the dataframe
df.loc[df['demo_race_other'].notnull(), 'demo_race'] = responded_races

# Drop the demo race other column
df = df.drop(columns='demo_race_other')

### Validating motivation ranks
---
Here we will be converting the responses within these columns to numbers and manually validating people who didn't properly list their rank for motivation (e.g., they chose other when one of the other categories fit well)

In [None]:
# Convert all of the rank columns to numeric
# exclude the rank_other_text column
rank_cols = df.columns[df.columns.str.contains('rank')].tolist()
rank_cols.remove('rank_other_text')

# Conver the columns to numeric as they are currently 1st, 2nd, 3rd, etc.
for col in rank_cols:
    df[col] = df[col].apply(lambda x: int(x[0]))


In [None]:
# Get a dataframe for each of the columns which have text responses which need to be cleaned
rank_other_df = df[['rank_other', 'rank_other_text']]

In [None]:
need_to_fix = rank_other_df.loc[df['rank_other'] != 7]

# Separate the need to fix dataframe into two dataframes one with NaN and one without
need_to_fix_nan = need_to_fix[need_to_fix['rank_other_text'].isnull()]
need_to_fix = need_to_fix[need_to_fix['rank_other_text'].notnull()]

In [None]:
# Get the index of the rows which need to be fixed
need_to_fix_idx = need_to_fix.index.tolist()

# Get the text responses which need to be fixed
need_to_fix_text = need_to_fix['rank_other_text'].tolist()

# Loop through the text responses to read through them
for idx, text in zip(need_to_fix_idx, need_to_fix_text):

    # Get the current values for each of the rank columns
    rank_vals = df.loc[idx, rank_cols].tolist()

    # Create a list of the rank columns based ordered by the current values
    rank_cols_ordered = [x for _, x in sorted(zip(rank_vals, rank_cols))]

    clear_output(wait=True)
    print(f'Index: {idx}')
    print(f'Text: {text}')
    print(f'Current value: {df.loc[idx, "rank_other"]}')
    print('Select the motivator which best fits the response:')
    for i in range(len(rank_cols_ordered)):
        print(f'[{i + 1}]: {rank_cols_ordered[i]}')

    motivation = input('Enter the number of the motivator: ')

    # If the user enters a blank, then skip the row
    if motivation == '':
        sys.stdout.flush()
        sys.stdin.flush()
        continue

    # Turn the motivation into an integer
    motivation = int(motivation) - 1

    # If the motivation selected is a higher rank than the rank of other, don't swap 
    if motivation < rank_cols_ordered.index('rank_other'):
        selected_motivation = rank_cols_ordered[int(motivation)]
        rank_cols_ordered[int(motivation)] = 'rank_other'
        rank_cols_ordered[rank_cols_ordered.index('rank_other')] = selected_motivation

    # Move the rank_other to the end of the list
    rank_cols_ordered.append(rank_cols_ordered.pop(rank_cols_ordered.index('rank_other')))

    # Update the dataframe with the new ranks
    for i in range(len(rank_cols_ordered)):
        df.loc[idx, rank_cols_ordered[i]] = i + 1
    
    sys.stdin.flush()


In [None]:
# Get the index of the rows which need to be fixed which are nan
need_to_fix_nan_idx = need_to_fix_nan.index.tolist()

# Loop through the indices and set the rank_other to 7
for idx in need_to_fix_nan_idx:
    # Get the current values for each of the rank columns
    rank_vals = df.loc[idx, rank_cols].tolist()

    # Create a list of the rank columns based ordered by the current values
    rank_cols_ordered = [x for _, x in sorted(zip(rank_vals, rank_cols))]

    # Move the rank_other to the end of the list
    rank_cols_ordered.append(rank_cols_ordered.pop(rank_cols_ordered.index('rank_other')))

    # Update the dataframe with the new ranks
    for i in range(len(rank_cols_ordered)):
        df.loc[idx, rank_cols_ordered[i]] = i + 1

# Dropping unneeded columns
---
Now that we have done a bit of intense manual data cleaning, let's drop some unnecessary columns and save our frame to a CSV

In [None]:
# Drop the rank_other column
df = df.drop(columns='rank_other')

In [None]:
# Drop the rank_other_text column
df = df.drop(columns='rank_other_text')

In [None]:
# Drop the timestamp, score, and extra credit columns
df = df.drop(columns=['Timestamp', 'Score', 'ec_class', 'ec_id'])

In [None]:
# Export the dataframe to a csv
df.to_csv('cleaned_data.csv', index=False)

### Standardizing Values
---
Now that we have some columns which we have dropped and ensured are consistent in relation to more "manualish" checks, let's standardized the values of other columns which are text fields.

In [None]:
# Read in the cleaned data
df = pd.read_csv('cleaned_data.csv')

In [None]:
# Get all the columns where their data type is an object
object_cols = df.columns[df.dtypes == 'object'].tolist()

# List all of these out
object_cols

In [None]:
# Don't actually need the use duolingo column so we can drop that
df.drop(columns='use_duolingo', inplace=True)

In [None]:
# Replace all of the yes and no with 1 and 0
df = df.replace({'Yes': 1, 'No': 0})

In [None]:
# Within the demo_gender column replace the values with 2, 1, or 0
gender_mapping = {'Female': 0, 'Male': 1, 'Prefer not to Say': 2, 'Non Binary': 2}

df['demo_gender'] = df['demo_gender'].replace(gender_mapping)

In [None]:
# Replace all Domestic and International with 1 and 0
df = df.replace({'Domestic': 1, 'International': 0})

In [None]:
# Get a list of all the unique values of demo_class
demo_class_unique = df['demo_class'].unique().tolist()

In [None]:
# Replace all of the classes with a standard format

class_mapping = {
    'LISP 1A Spanish Conversation' : 'LISP_1A',
    'LISP 1D Spanish Conversation' : 'LISP_1D',
    'LISP 1C Spanish Conversation' : 'LISP_1C',
    'CHIN 10CD First Year Chinese/Dialect III' : 'CHIN_10CD',
    'JAPN 10C First Year Japanese III' : 'JAPN_10C',
    'LTKO 2C Intermediate Korean: Second Year III' : 'LTKO_2C',
    'JAPN 20C Second Year Japanese III' : 'JAPN_20C',
    'CHIN 20CD Second Year Chinese/Dialect III' : 'CHIN_20CD',
    'CHIN 100CN Third Year Chinese/Non Native III' : 'CHIN_100CN',
    'LISP17 Intermediate Spanish for the Social Sciences' : 'LISP_17',
    'LISP18 Intermediate Spanish for the Health Sciences' : 'LISP_18',
}

# Replace the values in the demo_class column
df['demo_class'] = df['demo_class'].replace(class_mapping)

In [None]:
# Replace all of the engagement values with numerical values
engage_mapping = {
    'Very Often': 1,
    'Often': 2,
    'Sometimes': 3,
    'Occasionally': 4,
    'Infrequently': 5,
    'Rarely' : 6,
    'Never' : 7
}

# Replace the values in the engage_apps column
df = df.replace(engage_mapping)

In [None]:
# Assign individuals who put 0 as their number of languages to 1
df.loc[df['demo_num_lang'] == 0, 'demo_num_lang'] = 1

In [None]:
# Save the cleaned data to a csv
df.to_csv('cleaned_data.csv', index=False)