# Analysis of the Most Commonly Spoken Languages at Home in Niagara Falls, Ontario

# Introduction

The program below creates the following graph, which shows the top 20 languages spoken in Niagara Falls, Ontario, homes. Note that the y-axis is logarithmic due to the high number of English-speakers. 

![niagara-falls-top-20-languages.png](niagara-falls-top-20-languages.png)

## Data Source

+ Niagara Falls (Ontario) open data site: https://open.niagarafalls.ca/ 
+ License terms: https://open.niagarafalls.ca/pages/terms-of-use

## Extra Challenges

+ Adapt the program for your own jurisdiciton
+ Remove the prominent language (English in this case) from the chart to see how it changes.
+ (Any other challenges that might be appropriate? Let me know in "Issues", above.)

## Program Description

This program reads in a CSV file containing data on the most commonly spoken languages at home in Niagara Falls, Ontario, homes. It then extracts the columns that contain data on the total number of speakers for each language and creates a new DataFrame with these columns. The program then calculates the total number of speakers for each language and finds the 20 most commonly spoken languages. Finally, it creates a bar chart using Plotly Express to display the top 20 languages and their total number of speakers.

# Program

In [None]:
import pandas as pd
import plotly.express as px

# read in the CSV file
data_file = 'https://raw.githubusercontent.com/pbeens/Data-Analysis/main/Open-Data/Niagara-Falls-ON/Niagara-Falls-2021-Census-Language-Spoken-Most-Often-at-Home.csv'
nf_languages_df = pd.read_csv(data_file)

# extract the columns that contain data on the total number of speakers for each language
languages = []
for column in nf_languages_df.columns:
    if "(Total)" in column and len(column) < 100 and column not in ['Single responses (Total)', 'Official languages (Total)'] and not any(x in column for x in [" and ","Other", "Multiple", "languages"]):        
        languages.append(column)

# create a new DataFrame with these columns
# the technique used is to overcome a "DataFrame is highly fragmented." error
languages_df = pd.concat([nf_languages_df[language].rename(language.replace(" (Total)", "")) for 
                          language in languages], axis=1)

# calculate the total number of speakers for each language and find the 20 most commonly spoken languages
language_totals = languages_df.sum(numeric_only=True)
top_languages = language_totals.nlargest(20)

# create a bar chart using Plotly Express to display the top 20 languages and their total number of speakers
fig = px.bar(top_languages, 
             x=top_languages.index, 
             y=top_languages, 
             log_y=True)
fig.update_layout(xaxis_title='Language', 
                  yaxis_title='Total', 
                  title='Top 20 Languages in Niagara Falls, Ontario')
fig.show()