# Languages and language learning

As a part of an online statistics course on Udacity.com, I decided to work on a project where I could make use of the skills I learned. I chose to conduct a brief survey about people's knowledge of languages. In this document, I will show you how I collected the data, how I processed it, and some of the interesting results I obtained.

## Data collection

The survey was created using the free online tool Google Forms. I shared it among other students taking part in the aforementioned course, namely on the course's Slack channel and its Udacity forum. Besides those, I posted a link to the survey on Reddit subreddits r/SampleSize and r/languagelearning, and I also shared it among some of my friends. For these reasons, the results will not represent the general population in some aspects, as these channels are mainly English speaking communities and some age groups are not adequately represented.

## Data cleaning

After collecting enough data, I had to ensure that all of it is in the desired format. Google Forms offers the option to download survey data in the form of a csv file, which is what I worked with. Here is a brief summary of the data. The original dataset contained the column 'email', which was removed from the dataset used for this document.

In [None]:
import pandas as pd

with open("languages_data.csv", 'r') as datafile:
    data = pd.read_csv(datafile)

# Print the list of columns in the DataFrame
print(data.columns.values)

# Print a brief summary of selected columns
cols = ['native_no', 'foreign_no', 'age']
print(data[cols].describe())

The columns in which we're most interested are `native_no` (number of native languages a repondent has listed), `foreign_no` (number of foreign languages) and each of the `nativeX` and `foreignX` columns that contain the names of the languages a given person speaks.

If we run the code above, we can see that we have 240 responses in total (as the three columns above contain responses to questions that were obligatory). Some interesting observations we can make from this data is that everyone has at least one native language or that the respondents to this survey speak approximately 2 languages on average.

It also seems that the highest number of native languages a person has is 4. We can look at the column containing the 4th native language to check this.

In [None]:
print(data['native4'].unique())

It looks that the column doesn't contain any values at all. Since having four native languages is relatively uncommon, we can assume that it was a typo and that the highest number of native languages is in fact 3 or less. Therefore, we can drop columns `native4` and `native5` so that we don't have to worry about them later. For the time being, we will leave the (most likely wrong) value of 4 as it is, as we will deal with similar irregularities later.

In [None]:
# make sure column 'native5' doesn't contain any values
print(data['native5'].unique())

# drop columns 'native4' and 'native5'
data = data.drop(columns=['native4', 'native5'])

If we check each of the five columns that contain foreign languages, we can see that each contains at least one non-empty value. Therefore, we will keep all the columns intact.

We can look at values in the native and foreign languages columns to get an idea about what languages are the most common. We will see that we get some languages listed more than once. The reasons for this, among others, are inconsistent capitalization, trailing spaces, typos etc. We will have to correct these values before we can perform any analysis.

In [None]:
# create list of names of the columns that contain native and foreign languages
native_cols = ['native1', 'native2', 'native3']
foreign_cols = ['foreign1', 'foreign2', 'foreign3', 'foreign4', 'foreign5']

# print unique values in each column
for col in native_cols + foreign_cols:
    print(col)
    print(data[col].unique())

First, we will deal with capitalization. The Pandas module makes it relatively easy to perform string operations on a Pandas Series without iterating over all of them, which would be very inefficient. However, we still have to iterate over columns, as string methods can't be applied to a Pandas DataFrame.

In [None]:
for col in native_cols + foreign_cols:
    data[col] = data[col].str.title()
    print(col)
    print(data[col].unique())

We can already notice fewer unique names of languages. E.g., before, the column `native1` contained both `romanian` and `Romanian`. However, now there is a single value of `Romanian`.

Nevertheless, the dataset still contains values that can't be dealt with programatically. We will need to replace values such as `Ger,am` with its correct equivalent. For this, we will create a dictionary of misspellings and then iterate over the values to see if they need to be changed.

In [None]:
import numpy as np

language_typos = {'None': np.nan, '0': np.nan,
                  'English ': 'English', 'English (C2/Fluent?)': 'English', 'Ingliĺ\xa0': 'English', 'Englidh': 'English',
                  'English, American': 'English','American English': 'English', 'English (American)': 'English',
                  'British English': 'English', 'English Uk': 'English',
                  'Ger,Am': 'German', 'German ': 'German', 'German (Deutsch)': 'German',
                  "Spanish (I Learnt When I Was 9, So I'M Fluent But Not Technically Native)": 'Spanish',
                  'Spanish ': 'Spanish', 'Spansih': 'Spanish', 'French B1': 'French', 'French ': 'French',
                  'Portuguese ': 'Portuguese', 'Portuguăşs (Portuguese)': 'Portuguese',
                  'Japanese ': 'Japanese', 'Japonese ': 'Japanese', 'Chinese': 'Mandarin', 'Mandarin Chinese': 'Mandarin',
                  'Chinese (Mandarin)': 'Mandarin', 'Mandarin ': 'Mandarin', 'Contonese': 'Cantonese',
                  'Armenian ': 'Armenian', 'Norway': 'Norwegian',
                  'Korean ': 'Korean', 'Polski Kurwa': 'Polish', 'Romania': 'Romanian', 'Russian ': 'Russian',
                  'American Sign Language ': 'American Sign Language'                   
                  }

# check if a given cell value is in the language_typos dictionary, if so, replace it
for col in native_cols + foreign_cols:
    data[col] = data[col].fillna(value=np.nan)        # we want the empty value to be the same in the whole dataset
    for i, row in data[[col]].iterrows():
        cur_val = data.loc[i, col]
        if cur_val in language_typos:
            data.loc[i, col] = language_typos[cur_val]

Good, so now all the string values are what we want them to be. However, some people's answer to how many native or foreign languages they speak doesn't match the number of languages actually listed. For native languages, we will assume that the number of languages people wrote out is the correct number of their native languages. Therefore, we will correct the `native_no` column to match this number.

In [None]:
native_cols = ['native1', 'native2', 'native3']
# count how many languages a person listed
native_count = data[native_cols].count(axis='columns')

# check if there is a difference between the native_no column and actual number of languages listed
print("ORIGINAL DATA")
print(data.loc[data['native_no'] != native_count, ['native_no'] + native_cols])

# change data in the native_no column to match the actual number of native languages
data.loc[data['native_no'] != native_count, 'native_no'] = native_count

# check again if there is a difference between native_no and the actual count
print("\nCORRECTED DATA")
print(data.loc[data['native_no'] != native_count, ['native_no'] + native_cols])

Now we will do the same with the data for foreign languages. There is a difference, however: the table contains an additional column `foreign_other`, which contains language data for people who speak more than 5 foreign languages. This column can contain any number of languages separated by commas and we need to count them. To do this, we will create an extra column in the table called `foreign_other_count`. Obtaining the actual number of languages in each cell will not be hard thanks to Python's string methods.

In [None]:
# add all unique values in the column to a dictionary where
# KEY is the cell value and VALUE is the number of languages
foreign_other = data['foreign_other'].unique()
foreign_other_dict = {}
for item in foreign_other:
    try:
        foreign_other_dict[item] = len(item.replace(" ", "").split(','))
    except:
        foreign_other_dict[item] = 0

# now map the foreign_other column values to values in foreign_other_dict
data['foreign_other_count'] = data['foreign_other'].map(foreign_other_dict)
print(data['foreign_other_count'].unique())

We will have to take the `foreign_other_count` column into account when checking if the number of foreign languages is correct.

The correction to the `foreign_no` column will be done the same way as for the `native_no` column: the value in the `foreign_no` column will be set to the actual number of languages listed, except the case that a person indicated that they speak more than 5 foreign languages but only listed 5 (that means that the `foreign_other` value is empty). In this case, we want to keep the original value of `foreign_no`.

In [None]:
foreign_cols = foreign_cols = ['foreign1', 'foreign2', 'foreign3', 'foreign4', 'foreign5']
# count how many foreign languages a person listed
foreign_count = data[foreign_cols].count(axis='columns') + data['foreign_other_count']

# check if there are any differences
print("ORIGINAL DATA")
print(data.loc[data['foreign_no'] != foreign_count, ['foreign_no'] + foreign_cols])

# set the 'foreign_no' value to actual number of languages listed
# except for the case where 'foreign_no' is greater than 5
data.loc[(data['foreign_no'] != foreign_count) & (data['foreign_no'] <= 5), 'foreign_no'] = foreign_count

# check if there are any differences
print("\nCORRECTED DATA")
print(data.loc[data['foreign_no'] != foreign_count, ['foreign_no'] + foreign_cols])

We see that we have 3 rows left where the number of languages listed doesn't match the value in the `foreign_no` column. However, as the listing the remaining languages was optional, we want to keep the original value.

Finally, we want to make changes to the `combined_no` column which should contain the sum of native and foreign languages for each row.

In [None]:
# just for curiosity, let's look at how many combined_no values are wrong
print("ORIGINAL DATA")
print(pd.DataFrame(data['native_no'] + data['foreign_no'] == data['combined_no']).describe())

# set the 'combined_no' value to the sum of 'native_no' and 'foreign_no'
data['combined_no'] = data['native_no'] + data['foreign_no']
print("\nCORRECTED DATA")
print(pd.DataFrame(data['native_no'] + data['foreign_no'] == data['combined_no']).describe())

Finally, we will want to make some minor changes to the `foreignX_lvl`, `eff_X` and `enj_X` columns. The `foreignX_lvl` (where X is a number 1-5) indicates the level of a person's ability in the language listed in column `foreignX`. Its values can be A1, A2, B1, B2, C1 or C2 (based on CEFR levels) where A1 is beginner and C2 is proficient. We want to convert these to numerical values so that later we can perform some analysis on this data.

The columns `eff_X` and `enj_X` indicate how much a person thinks the learning method X is efficient or how much they enjoy it. The methods that can be substituted for X are: `school` (school classes), `course` (courses outside of school), `self` (self_study), `textbook` (learning from textbooks), `online` (online-learning tools such as videos, podcasts, mobile applications), `native` (conversation with native speakers), `media` (films, books, news etc.) and `travel` (traveling to the country where a given language is spoken). The values range from 1 to 5, where 1 is least efficient/enjoyable and 5 is most efficient/enjoyable. There is also a value "Don't know", which we want to substitute for NaN so that we can analyse the non-empty values later.

In [None]:
# map language levels to numbers 1-6
level_cols = ['foreign1_lvl', 'foreign2_lvl', 'foreign3_lvl', 'foreign4_lvl', 'foreign5_lvl']
level_to_number = {'A1': 1, 'A2': 2, 'B1': 3, 'B2': 4, 'C1': 5, 'C2': 6}
for col in level_cols:
    data[col] = data[col].map(level_to_number)
# look at a summary of 2 of the columns
print("LEVEL STATS")
for col in level_cols[:2]:
    print(data[col].describe())
    
# map efficiency and enjoyment values to numbers 1-5 or NaN
eff_cols = ['eff_school', 'eff_course', 'eff_self', 'eff_textbook',
            'eff_online', 'eff_native', 'eff_media', 'eff_travel']
enj_cols = ['enj_school', 'enj_course', 'enj_self', 'enj_textbook',
            'enj_online', 'enj_native', 'enj_media', 'enj_travel']
quality_to_number = {"Don't know": np.nan, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5}
for col in eff_cols + enj_cols:
    data[col] = data[col].map(quality_to_number)
# look at data for 'eff_native' and 'enj_course'
print("\nEFFICIENCY/ENJOYMENT STATS")
for col in ['eff_native', 'enj_course']:
    print(data[col].describe())

Good! Now that our data is finally in the form we want it to be, let's look at some statistics!

## Data analysis

### Basic data about languages

First, let's look at the first three columns: `combined_no`, `native_no` and `foreign_no`. Now that we know that the data is correct, we can look at the mean, mode and median of each of them.

In [None]:
cols = ['combined_no', 'native_no', 'foreign_no']
print("MEAN")
print(data[cols].mean())
print("\nMEDIAN")
print(data[cols].median())
print("\nMODE")
print(data[cols].mode())

As you can see, in the `foreign_no` column, two values are present the same number of times. That means that the same number of people speak 1 and 2 foreign languages.

We can put all the values in a simple table so that we can come back to it later.

|        | total number of languages | number of native languages | number of foreign languages |
|--------|---------------------------|----------------------------|-----------------------------|
| mean   |                  3.350000 |                   1.216667 |                    2.133333 |
| median |                       3.0 |                        1.0 |                         2.0 |
| mode   |                       3.0 |                        1.0 |                    1.0; 2.0 |

However, these three values - mean, median and mode - don't tell us much on their own. It would be better to see how the values are distributed. We'll use one of the most popular Python graphic modules Matplotlib to show these values on a histogram.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

mean = data['combined_no'].mean()
median = data['combined_no'].median()
mode = data['combined_no'].mode()[0]
max = data['combined_no'].max()
min = data['combined_no'].min()

# create histogram
plt.hist(data['combined_no'], bins=int(max-min+1), range=(min,max+1),
         align='left', color='paleturquoise', rwidth=0.9)
plt.title('Total number of languages')
plt.xticks(list(range(1,12)))
plt.axvline(x=mean, color='navy')
plt.axvline(x=median+0.04, color='green')
plt.axvline(x=mode-0.04, color='darkorange')

# add legend
mean_patch = mpatches.Patch(color='navy', label='Mean')
median_patch = mpatches.Patch(color='green', label='Median')
mode_patch = mpatches.Patch(color='darkorange', label='Mode')
plt.legend(handles=[mean_patch, median_patch, mode_patch],
           bbox_to_anchor=(1,1), loc=2)

plt.show()

Likewise, we can create similar visualizations for each of the 3 columns.

<img src="./images/figure1.png" />

Note that the mean, median and one of the modes are the same for number of foreign languages people speak.

Let's look at what languages people speak most. As before, we can look at native and foreign languages individually, as well as at all of them combined. The following three charts show the 15 most spoken languages in the dataset.

<img src="./images/figure13.png" />
<img src="./images/figure14.png" />
<img src="./images/figure15.png" />

Now, we will look at foreign languages only. For every foreign language people listed, they also indicated their level in this language. The levels are A1, A2, B1, B2, C1 and C2, A1 being beginner and C2 being proficient. Since there is no point at looking at each level column in the table individually (the order in which people listed the languages is arbitrary, therefore unimportant), we will only consider all of them together. The following pie chart shows what percentage of people list their language skills at a certain level.

<img src="./images/figure16.png" />

We see that the most common language level is A1 and the least common one is C2, however, the differences are not significantly big.

Additionally, we might wonder if the number of languages a person speaks affect their levels in each one. Since values in level columns are numeric, we can take the average of them and compare them to the number of foreign languages to see how they correlate.

In [None]:
# compare average level and number of foreign languages
data['level_means'] = data[level_cols].mean(axis=1)
print(data[['foreign_no', 'level_means']].corr())

The correlation coefficient is almost 0, therefore there is no relationship beteween the number of languages a person speaks and their average level in them. However, if we look at how people's highest level relates to the number of languages they speak, we will get slightly different data.

In [None]:
# compare highest level and number of foreign lanugages
data['level_max'] = data[level_cols].max(axis=1)
print(data[['foreign_no', 'level_max']].corr())

Here we get a correlation coefficient of approximately 0.45, which is slight to moderate correlation.

### Data by demographics

Now that we know how the data looks overall, we can divide it among several groups and look at the differences between those. Let's start with women and men. The column 'gender' also contains other values (e.g. 'Non-binary'), but there are so few of them that it wouldn't make sense to compare them. We'll look at the 'native_no' and 'foreign_no' columns only. First, we will compare mean, median and mode for women and men.

In [None]:
# extract data for women and men separately
women = data.loc[data['gender']=='Female', ['native_no', 'foreign_no']]
men = data.loc[data['gender']=='Male', ['native_no', 'foreign_no']]

print('NATIVE LANGUAGES\t\t\tFOREIGN LANGUAGES')
print('\nWOMEN\t\t\t\t\tWOMEN')
print('count:\t', women['native_no'].count(),'\t\t\t\tcount:\t', women['foreign_no'].count())
print('mean:\t', women['native_no'].mean(), '\t\tmean:\t', women['foreign_no'].mean())
print('median:\t', women['native_no'].median(), '\t\t\t\tmedian:\t', women['foreign_no'].median())
print('mode:\t', women['native_no'].mode()[0], '\t\t\t\tmode:\t', women['foreign_no'].mode()[0])
print('\nMEN\t\t\t\t\tMEN')
print('count:\t', men['native_no'].count(),'\t\t\t\tcount:\t', men['foreign_no'].count())
print('mean:\t', men['native_no'].mean(), '\t\tmean:\t', men['foreign_no'].mean())
print('median:\t', men['native_no'].median(), '\t\t\t\tmedian:\t', men['foreign_no'].median())
print('mode:\t', men['native_no'].mode()[0], '\t\t\t\tmode:\t', men['foreign_no'].mode()[0])

Here's the table for future reference:

|          | Native languages - Women | Native languages - Men | Foreign languages - Women | Foreign languages - Men |
|----------|--------------------------|------------------------|---------------------------|-------------------------|
| Total    |                      121 |                    110 |                       121 |                     110 |
| Mean     |       1.2066115702479339 |     1.2181818181818183 |         2.206611570247934 |       2.090909090909091 |
| Median   |                      1.0 |                    1.0 |                       2.0 |                     2.0 |
| Mode     |                      1.0 |                    1.0 |                       2.0 |                     1.0 | 

As we see, most values are almost equal for men and women, with one exception - mode of number of foreign languages. This is an indicator that even though mean and median are the same, the distribution may vary slightly. The best way to see this is to plot a histogram.
<img src=".\images\figure3.png" \>
<img src=".\images\figure2.png" \>
We can see that the distributions of native languages are basically equal. The only difference between the two histograms for foreign languages is the mode. Otherwise, the shape of the distribution is the same.

Next, we can compare the number of languages people speak at different ages. We could hypothesize that the older people get, the more time they have had to learn foreign languages, therefore they migh speak more of them. Therefore, the correlation coefficient between the `age` and `foreign_no` column should be close to 1. The Pandas module provides the function `corr()` to check the correlation between two columns.

In [None]:
print(data[['foreign_no', 'age']].corr())

We see that the opposite of our hypothesis is actually true. The correlation coefficient is very close to 0, which means that there is almost no relationship between age and the number of languages people speak. We can visualize the data from the two columns to see that this is true:
<img src=".\images\figure4.png" \>
Additionally, we can group all responses by age and compare the mean and median of foreign languages spoken for each group to see that there is no obvious trend going on.
<img src=".\images\figure5.png" \>
Note: Age groups 0-9 and 60 and more are excluded since there is not enough data.

Finally, we can compare how number of languages spoken varies depending on countries. First, we'll look at the difference between native English speakers and native speakers of other languages. Since many people claim that native English speakers can't speak other languages, we will test the assumption that the average number of foreign languages will be smaller for this group.

In [None]:
# select rows with native English speakers
english_is_native = (data['native1'] == 'English') | (data['native2'] == 'English') | (data['native3'] == 'English')

print("FOREIGN LANGUAGES")
print("\n\tNATIVE: English\t\t NATIVE: Other")
print("mean:  ", data.loc[english_is_native, 'foreign_no'].mean(), "\t",
      data.loc[~english_is_native, 'foreign_no'].mean())
print("median:", data.loc[english_is_native, 'foreign_no'].median(), "\t\t\t",
      data.loc[~english_is_native, 'foreign_no'].median())
print("mode:  ", data.loc[english_is_native, 'foreign_no'].mode()[0], "\t\t\t",
      data.loc[~english_is_native, 'foreign_no'].mode()[0])

We see that the difference in all three indicators is considerable. Not only is the median one language more fore people whose native language isn't English, but even the difference in mean is approximately 0.7. However, it is important to look at the distribution before making any definite conclusions.
<img src="./images/figure6.png" \>
From the histogram, it is clear that the only real difference is people who speak 0 or 1 foreign languages. Actually, no people whose native language isn't English indicated that they speak 0 foreign languages, which is understandable, since the survey was conducted in English, therefore they speak it as a foreign language. Therefore, we can conclude that native English speakers are more likely to speak 0 or 1 foreign language than people whose foreign languages is a language other than English.

Next, let's see at differences between specific countries. Since we don't have enough data for each country in the world, we will divide the country into continents. The continent Australia won't be taken into account, since there are only two datapoints.

In [None]:
# map countries to continents
country_to_continent = {'Nigeria': 'Africa', 'Egypt': 'Africa', 'Algeria': 'Africa',
                       'Brazil': 'South America', 'Bolivia': 'South America', 'Chile': 'South America',
                       'Ecuador': 'South America',
                       'Portugal': 'Europe', 'Poland': 'Europe', 'Italy': 'Europe', 'Austria': 'Europe',
                       'Slovakia': 'Europe', 'Estonia': 'Europe', 'Germany': 'Europe', 'Netherlands': 'Europe',
                       'Belgium': 'Europe', 'United Kingdom': 'Europe', 'Spain': 'Europe', 'Ireland': 'Europe',
                       'Russia': 'Europe', 'France': 'Europe', 'Finland': 'Europe', 'Latvia': 'Europe',
                       'Czech Republic': 'Europe', 'Denmark': 'Europe', 'Romania': 'Europe', 'Norway': 'Europe',
                       'Greece': 'Europe', 'Sweden': 'Europe',
                       'Jordan': 'Asia', 'Saudi Arabia': 'Asia', 'India': 'Asia', 'Korea, South': 'Asia',
                       'Hong Kong': 'Asia', 'United Arab Emirates': 'Asia', 'Turkey': 'Asia', 'Quatar': 'Asia',
                       'Israel': 'Asia', 'Singapore': 'Asia', 'Malaysia': 'Asia', 'Syria': 'Asia',
                       'Philippines': 'Asia', 'Vietnam': 'Asia', 'China': 'Asia', 'Japan': 'Asia',
                       'United States': 'North America', 'Canada': 'North America',
                       'Australia': 'Australia'}
continents = data['country'].map(country_to_continent)
# divide the foreign_no column into groups by continents
grouped_continents = data['foreign_no'].groupby(continents)

Here's a table of the data that is interesting to us:

| Continent | Mean | Median | Mode |
|---|---|---|---|
| Africa | 1.333333 | 2.0 | 2.0 |
| Asia | 1.666667 | 2.0 | 1.0 |
| Europe | 2.433628 | 2.0 | 2.0 |
| North America | 1.857143 | 1.0 | 1.0 |
| South America | 1.860465 | 2.0 | 1.0 |

According to this table, Europe might seem like an outlier, since a European speaks on average almost 0.6 more languages than a person from Asia, which is second highest. To see how Europeans really compare to everyone else in the dataset, we'll look at the so-called z-score, which is essentially the distance of a specific value from the sample's mean.

In [None]:
stdev_sample = data['foreign_no'].std()
difference_europe_mean = grouped_continents.get_group('Europe').mean() - data['foreign_no'].mean()
print("Z-score:")
print(difference_europe_mean/stdev_sample)

The average of Europe is therefore only 0.21 standard deviations more than the average of the whole sample, which isn't a significant difference. We can plot a histogram to see the distributions of Europe, Asia and North America - the three continents with most datapoints in the dataset.
<img src="./images/figure7.png" \>
What the graph shows us are three almost identically shaped distributions. There is one difference, however - it seems that people from North America are more likely to speak only 2 or fewer languages than people from Asia and Europe.

### Learning languages

Now that we have finished analysing demographics, let's look at how people learn foreign languages. We'll start with which methods for learning are the most popular. The dataset contains the column `methods`, where we can find which methods each respondent uses to learn languages. To find out how many people use each of the methods, it is necessary to manipulate data in this column.

In [None]:
methods = {}
total_learners = 0
for row in data[['methods']].iterrows():
    if type(row[1]['methods']) != float:
        total_learners += 1
        row_methods = row[1]['methods'].split(';')
        for method in row_methods:
            methods[method] = methods.get(method, 0) + 1
print(total_learners)

We have created a dictionary where keys are the methods that people use and values are the numbers of people that use each method. Now we can visualize how widely used each method is.
<img src="./images/figure8.png" \>
According to this graph, the most popular methods of learning are self-study, school classes and media such as films and books. This data shows us how much each method is used, however, it doesn't say anything about how effective and how liked each of them is. To see this, we'll have to look at columns `eff_X` and `enj_X`, where X can be replaced by one of the methods above. These columns contain values 1-5, where 1 is least and 5 is most effective and enjoyed, respectively. This allows us to take an average for each of them.

In [None]:
eff_cols = ['eff_school', 'eff_course', 'eff_self', 'eff_textbook',
            'eff_online', 'eff_native', 'eff_media', 'eff_travel']
enj_cols = ['enj_school', 'enj_course', 'enj_self', 'enj_textbook',
            'enj_online', 'enj_native', 'enj_media', 'enj_travel']
print("EFFICIENCY SCORES")
for col in eff_cols:
    print(col, data[col].mean())
print("\nENJOYMENT SCORES")
for col in enj_cols:
    print(col, data[col].mean())

We see that the most efficient method by far seems to be talking to a native speaker of target language. However, this is only the third most popular method, since both media (books, films, music etc.) and travel have a higher average score.

An interesting thing that this table shows us is the fact that methods with lower efficiency scores tend to have a lower enjoyment score. To see if there is any correlation between how efficient people think a method is and how much they enjoy it, we can compare all the efficiency columns to their respective enjoyment columns.

In [None]:
eff_cols = ['eff_school', 'eff_course', 'eff_self', 'eff_textbook',
            'eff_online', 'eff_native', 'eff_media', 'eff_travel']
for i, col in enumerate(eff_cols):
    print('\n')
    print(data[[col, enj_cols[i]]].corr())

This table shows us that correlation between efficiency and enjoyment varies between moderate positive correlation (coefficient of 0.5 or more) and strong positive correlation (coefficient of 0.7 or more). This doesn't necessarily mean that more effective methods tend to be enjoyed more, only that people are more likely to find methods that they enjoy effective. We can plot the relationship between indicated score for efficiency and enjoyment respectively for a specific method to see this trend.
<img src="./images/figure9.png" \>
The bigger the point on the graph is, the more people listed a specific combination of efficiency and enjoyment. We see that the biggest dots lie approximately on the line x=y.

Now, let's look at two interesting statistics: when people started learning their first foreign language and how many hours per week they devote to studying one. What might be a little bit surprising is the fact that the mean and median of the age when people first started to study a foreign language is almost the same:

| Mean | Median | Mode |
|---|---|---|
| 10.13 | 10.0 | 11.0 |

We can see that the most values fall somewhere between ages 6 and 12, e.i. the first few years of school in most countries.
<img src="./images/figure10.png" \>
Just like the data for age, data the indicates how many hours per week people spend learning a language is concentrated at one place, in this case it is close to 0.

| Mean | Median | Mode |
|---|---|---|
| 6.72 | 5.0 | 5.0 |

As we see, both median and mode are 5.0 hours per week. This is an interesting value, since it can be interpreted as 1 hour every working day. Half of all people then study less than 1 hour per day and half study more than that.
<img src="./images/figure11.png" \>

Finally, the last thing we're going to look at is how many people actually learn foreign languages. We can visualize this nicely on a pie chart.

<img src="./images/figure12.png" \>
As you can see, a majority of people who took the survey are learning a language at present, and only 6% of people are sure they don't want to learn one in the future.

## Conclusion

The tables, graphs and text above document statistics about differences in ability to speak languages and how people learn languages. The size of the sample was only 240, therefore it isn't representative of the whole population. However, the statistics examined in this document should be applicable to any sample of a larger size as well.

The tools used to collect and process the data are Google Forms, Python and Python modules Numpy, Pandas and Matplotlib. The code used in creation of the graphs, as well as complete data, is available on author's Github profile: https://github.com/kszabova/language-statistics.