## Data ingestion and scraping 

### How to use this notebook

The goal of this notebook is to download the necessary datasets and structurally prepare them for data cleaning and analysis.

Use this notebook after downloading the datasets from their respective URL. In the case of the Wikipedia data, use the code below to scrape the table from the webpage.

### Overview

We will be pulling data from multiple sources that describe various features of a language. This data will be pulled from three main sources, then compiled into one dataset that will be used to train our model.

The datasets and features are as follows: 

1. **Endangered Languages Dataset**

  - *URL:* https://endangeredlanguages.com/userquery/
  - Number of speakers (numeric- discrete)
  - Areas and countries where spoken (categorical)
  - Level of endangerment (categorical)

2. **Wikipedia list of official languages by country and territory and List of Languages by Speaker Count** $^*$

  - *URL:* https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory
  - *URL*: https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers 
  - Recognized as a country's official language by a governement body (binary)
  - Widely spoken, regional, minority or national language (categorical)
  - Number of speakers of non-endangered languages (discrete)

3. **World Bank Indicators**
  - *URL:* https://data.worldbank.org/indicator?tab=all
  - Rate of urbanization in countries where it is spoken (numeric- continuous)
  - Percentage of population using the internet (numeric- continuous)

$^*$ **NOTE:** Under Wikipedia's Creative Commons Attribution-ShareAlike 4.0 International license, you are free to:
  - Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
  - Adapt — remix, transform, and build upon the material for any purpose, even commercially.
  - The licensor cannot revoke these freedoms as long as you follow the license terms.

See more at https://creativecommons.org/licenses/by-sa/4.0/deed.en

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
os.chdir('/workspace')

FileNotFoundError: [Errno 2] No such file or directory: '/workspace'

### Endangered Langauges Dataset

In [43]:
## Read csv

df1 = pd.read_csv('Project/endangered_languages.csv', header=None)
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,3645,knw,!Xun,Ju; !Xun (Ekoka); Kung-Ekoka; !Kung; Ekoka-!Xû...,"Vulnerable (20 percent certain, based on the e...","14,000-18,000",Kx'a,"Southeastern !Xun, Northwestern !Xun, Central ...",,,South Africa;Namibia;Angola;,Africa,"-28.74358,23.983154; -17.560247, 18.050537; -1..."
1,3956,bpk,'Ôrôê,Orowe; Boewe; Neukaledonien;,"Endangered (20 percent certain, based on the e...",590,Austronesian; Malayo-Polynesian; Oceanic; New ...,,,,New Caledonia;,Pacific,"-21.4223,165.4678"
2,1933,taa,(Lower) Tanana,,"Critically Endangered (80 percent certain, bas...",25,Athabaskan-Eyak-Tlingit; Dene (Athabaskan),Minto-Nenana; Salcha; Chena,,Tanana is the language of the Lower Tanana riv...,USA;,North America,"65.157778, -149.37;64.521111, -146.980556;64.5..."
3,1043,con,A'ingae,Kofane; Cofán; Kofán; A'i; A'ingaé; Colin; Kof...,"Vulnerable (100 percent certain, based on the ...",1500,Isolate; South American,,,,Colombia;Ecuador;,South America,"0.054639, -77.409417"
4,3581,aas,Aasáx,"Asax; Asá; Aasá; Assa; Asak; ""Ndorobo""; ""Dorob...",Dormant,0,Afro-Asiatic; Cushitic; South Cushitic,,,,Tanzania;,Africa,"-5.1948,37.738"


In [44]:
## Insert column headers

df1.columns = ['index', 'abbrv', 'official_name', 'other_names', 'level', 'speakers', 'root_1', 'root_2', 'root_3', 
                'root_4', 'country', 'continent', 'long_lat']
df1.head()

Unnamed: 0,index,abbrv,official_name,other_names,level,speakers,root_1,root_2,root_3,root_4,country,continent,long_lat
0,3645,knw,!Xun,Ju; !Xun (Ekoka); Kung-Ekoka; !Kung; Ekoka-!Xû...,"Vulnerable (20 percent certain, based on the e...","14,000-18,000",Kx'a,"Southeastern !Xun, Northwestern !Xun, Central ...",,,South Africa;Namibia;Angola;,Africa,"-28.74358,23.983154; -17.560247, 18.050537; -1..."
1,3956,bpk,'Ôrôê,Orowe; Boewe; Neukaledonien;,"Endangered (20 percent certain, based on the e...",590,Austronesian; Malayo-Polynesian; Oceanic; New ...,,,,New Caledonia;,Pacific,"-21.4223,165.4678"
2,1933,taa,(Lower) Tanana,,"Critically Endangered (80 percent certain, bas...",25,Athabaskan-Eyak-Tlingit; Dene (Athabaskan),Minto-Nenana; Salcha; Chena,,Tanana is the language of the Lower Tanana riv...,USA;,North America,"65.157778, -149.37;64.521111, -146.980556;64.5..."
3,1043,con,A'ingae,Kofane; Cofán; Kofán; A'i; A'ingaé; Colin; Kof...,"Vulnerable (100 percent certain, based on the ...",1500,Isolate; South American,,,,Colombia;Ecuador;,South America,"0.054639, -77.409417"
4,3581,aas,Aasáx,"Asax; Asá; Aasá; Assa; Asak; ""Ndorobo""; ""Dorob...",Dormant,0,Afro-Asiatic; Cushitic; South Cushitic,,,,Tanzania;,Africa,"-5.1948,37.738"


In [45]:
## Save updated csv to repo folder

df1.to_csv('Project/datasci207-final-project/Data/endangered_languages.csv')

### Wikipedia Datasets

In [3]:
## Official language: Scrape from Wikipedia page

!pip install lxml

## URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory"

## Read all tables from the page
tables = pd.read_html(url)

## Inspect how many tables were found
print(f"Number of tables found: {len(tables)}")

## Example: View the first one (this is usually the main table)
df2 = tables[1]
print(df2.head())

Number of tables found: 9
         Country/Region  Number of official (including de facto)  \
0           Abkhazia[a]                                        2   
1  Afghanistan[1][2][3]                                        2   
2            Albania[4]                                        1   
3            Algeria[5]                                        2   
4               Andorra                                        1   

    Official language(s)                               Regional language(s)  \
0         Abkhaz Russian                                                NaN   
1  Persian (Dari) Pashto  Uzbek[b] Turkmen[b] Pashayi[b] Nuristani[b] Ba...   
2               Albanian                                                NaN   
3          Arabic Berber                                                NaN   
4             Catalan[6]                                                NaN   

         Minority language(s)   National language(s)   Widely spoken  
0                  

In [47]:
## Save table to csv in repo folder 

df2.to_csv('Project/datasci207-final-project/Data/official_languages.csv')

In [48]:
## Read new csv

df2 = pd.read_csv('Project/datasci207-final-project/Data/official_languages.csv')
df2.head()

Unnamed: 0.1,Unnamed: 0,Country/Region,Number of official (including de facto),Official language(s),Regional language(s),Minority language(s),National language(s),Widely spoken
0,0,Abkhazia[a],2,Abkhaz Russian,,Georgian,Abkhaz,
1,1,Afghanistan[1],2,Persian (Dari) Pashto,Uzbek[b] Turkmen[b] Pashayi[b] Nuristani[b] Ba...,,Persian (Dari) Pashto,Persian (Dari)
2,2,Albania[2],1,Albanian,,Greek Macedonian Aromanian,,Italian
3,3,Algeria[3],2,Arabic Berber,,,Arabic Berber,French
4,4,Andorra,1,Catalan[4],,Spanish French Portuguese,,


In [3]:
## Number of Speakers: Scrape from Wikipedia page

!pip install lxml

## URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers"

## Read all tables from the page
tables_2 = pd.read_html(url)

## Inspect how many tables were found
print(f"Number of tables found: {len(tables_2)}")

## Example: View the first one (this is usually the main table)
df5 = tables_2[0]
print(df5.head())

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Number of tables found: 7
                                            Language         Family  \
                                            Language         Family   
0                   English (excl. creole languages)  Indo-European   
1  Mandarin Chinese (incl. Standard Chinese, but ...   Sino-Tibetan   
2                                 Hindi (excl. Urdu)  Indo-European   
3                   Spanish (excl. creole languages)  Indo-European   
4            Modern Standard Arabic (excl. dialects)   Afro-Asiatic   

       Branch Numbers of speakers (millions)                        \
       Branch           First- language (L1) Second- language (L2)   
0    Germanic                            390                  1138   
1     Sin

In [4]:
## Save table to csv in repo folder 

df5.to_csv('Project/datasci207-final-project/Data/speaker_count.csv')

In [5]:
## Read new csv

df5 = pd.read_csv('Project/datasci207-final-project/Data/speaker_count.csv')
df5.head()

Unnamed: 0.1,Unnamed: 0,Language,Family,Branch,Numbers of speakers (millions),Numbers of speakers (millions).1,Numbers of speakers (millions).2
0,,Language,Family,Branch,First- language (L1),Second- language (L2),Total (L1+L2)
1,0.0,English (excl. creole languages),Indo-European,Germanic,390,1138,1528
2,1.0,"Mandarin Chinese (incl. Standard Chinese, but ...",Sino-Tibetan,Sinitic,990,194,1184
3,2.0,Hindi (excl. Urdu),Indo-European,Indo-Aryan,345,264,609
4,3.0,Spanish (excl. creole languages),Indo-European,Romance,484,74,558


### World Bank Indicators: urban development

In [49]:
## Read csv- skip rows with extra comments

df3 = pd.read_csv('Project/wb_urban.csv', skiprows= 3)
df3.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,50.776,50.761,50.746,50.73,50.715,50.7,...,43.192,43.293,43.411,43.546,43.697,43.866,44.052,44.254,,
1,Africa Eastern and Southern,AFE,Urban population (% of total population),SP.URB.TOTL.IN.ZS,14.576676,14.825175,15.083802,15.363045,15.655383,15.955912,...,34.919544,35.396289,35.893398,36.384272,36.884034,37.393633,37.909012,38.424898,,
2,Afghanistan,AFG,Urban population (% of total population),SP.URB.TOTL.IN.ZS,8.401,8.684,8.976,9.276,9.586,9.904,...,25.02,25.25,25.495,25.754,26.026,26.314,26.616,26.933,,
3,Africa Western and Central,AFW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,14.710006,15.094445,15.487932,15.900682,16.331319,16.779793,...,45.47385,46.094137,46.709753,47.322617,47.931021,48.531971,49.129808,49.711184,,
4,Angola,AGO,Urban population (% of total population),SP.URB.TOTL.IN.ZS,10.435,10.798,11.204,11.624,12.058,12.504,...,64.149,64.839,65.514,66.177,66.825,67.46,68.081,68.688,,


In [50]:
## Save updated csv to repo folder

df3.to_csv('Project/datasci207-final-project/Data/wb_urban.csv')

### World Bank Indicators: internet users

In [51]:
## Read csv- skip rows with extra comments

df4 = pd.read_csv('Project/wb_internet.csv', skiprows= 3)
df4.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,,,...,93.5,97.2,,,,,,,,
1,Africa Eastern and Southern,AFE,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,,,...,,,,,,,,,,
2,Afghanistan,AFG,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,,,...,11.0,13.5,16.8,17.6,17.0,16.5,17.2,17.7,,
3,Africa Western and Central,AFW,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,,,...,,,,,,,,,,
4,Angola,AGO,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,,,...,23.2,26.0,29.0,32.1,36.6,39.4,42.1,44.8,,


In [52]:
## Save updated csv to repo folder

df4.to_csv('Project/datasci207-final-project/Data/wb_internet.csv')

### 

In [None]:
import requests
from bs4 import BeautifulSoup

# Example: you load your html page (use actual URL or file path)
# html = requests.get('URL').text
with open('your_file.html') as f:
    html = f.read()

soup = BeautifulSoup(html, 'html.parser')

# Extract all headers (country names likely are <h3>, <h2>, etc. depending on site)
headers = soup.find_all(['h1', 'h2', 'h3', 'strong', 'b'])

# Loop through each country header and its following sibling text
for header in headers:
    country = header.get_text().strip()

    # Look for next sibling text
    sibling = header.find_next_sibling()
    while sibling and (sibling.name is None or 'major-language' in sibling.get_text().lower()):
        sibling = sibling.find_next_sibling()

    if sibling:
        lang_text = sibling.get_text().strip()
        print(f"Country: {country}")
        print(f"Languages: {lang_text}")
        print()
