# Take a first look at the data
________

The first thing we'll need to do is load in the libraries and datasets we'll be using. For today, we'll be using a dataset of landslides courtesy of NASA.

In [1]:
# modules we'll use
import pandas as pd
import numpy as np
import glob
import getorg
from geopy import Nominatim

# read in all our data
landslide_data = pd.read_csv("GLC03122015.csv")

# set seed for reproducibility
np.random.seed(0)

Iywidgets and ipyleaflet support disabled. You must be in a Jupyter notebook to use this feature.
Error raised:
No module named 'ipyleaflet'
Check that you have enabled ipyleaflet in Jupyter with:
    jupyter nbextension enable --py ipyleaflet


The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

In [2]:
# look at a few rows of the nfl_data file. I can see a handful of missing data already!
landslide_data.sample(5)

Unnamed: 0,the_geom,OBJECTID,id,date_,time_,country,nearest_pl,hazard_typ,landslide_,trigger,...,population,countrycod,continentc,key_,version,user_id,tstamp,changeset_,latitude,longitude
1675,POINT (78.5925000000001 30.76520000000008),1673,3750,07/09/2011 07:00:00 AM +0000,,India,Gangotri,landslide,Landslide,Downpour,...,17123,,AS,IN,1,1,Tue Apr 01 2014 00:00:00 GMT+0000 (UTC),1,30.7652,78.5925
2836,POINT (-75.30810000000004 6.504600000000045),2837,6670,11/14/2014 08:00:00 AM +0000,,,La Montera,landslide,Landslide,Unknown,...,16707,,SA,CO,1,7,Tue Jan 13 2015 23:28:35 GMT+0000 (UTC),3631908793,6.5046,-75.3081
3641,POINT (77.1500000000001 28.758000000000063),3642,4875,05/26/2013 07:00:00 AM +0000,,India,Â Jammu-Srinagar,landslide,Landslide,Downpour,...,20736,,AS,IN,1,1,Tue Apr 01 2014 00:00:00 GMT+0000 (UTC),1,28.758,77.15
6226,POINT (74.8760000000001 32.73700000000006),6229,3908,08/11/2011 07:00:00 AM +0000,12/30/1899 08:00:00 AM +0000,India,On circular road near Panjtirthi,landslide,Landslide,Rain,...,465567,,AS,IN,1,1,Tue Apr 01 2014 00:00:00 GMT+0000 (UTC),1,32.737,74.876
3548,POINT (-74.18990000000004 4.617600000000043),3548,4089,12/08/2011 08:00:00 AM +0000,,Colombia,"Bosa, Bogota",landslide,Mudslide,Downpour,...,313945,,SA,CO,1,1,Tue Apr 01 2014 00:00:00 GMT+0000 (UTC),1,4.6176,-74.1899


# See how many missing data points we have
___

Ok, now we know that we do have some missing values. Let's see how many we have in each column. 

In [3]:
# get the number of missing data points per column
missing_values_count = landslide_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

the_geom         0
OBJECTID         0
id               0
date_            4
time_         5519
country        608
nearest_pl      64
hazard_typ       0
landslide_       0
trigger          0
storm_name    6323
fatalities       0
injuries         0
source_nam    4293
source_lin     505
location_a       0
landslide1       2
photos_lin    6499
cat_src          1
cat_id           0
countrynam       0
near             1
distance         0
adminname1      71
adminname2    2994
population       0
countrycod    6788
continentc    2171
key_             2
version          0
user_id          0
tstamp           0
changeset_       0
latitude         0
longitude        0
dtype: int64

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

In [4]:
# how many total missing values do we have?
total_cells = np.product(landslide_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

15.087549457025002

Wow, 15% of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

# Figure out why the data is missing
____
 
This is the point at which we get into the part of data science that I like to call "data intution", by which I mean "really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most important question you can ask yourself to help figure this out is this:

> **Is this value missing becuase it wasn't recorded or becuase it doesn't exist?**

If a value is missing becuase it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probalby do want to keep as `NaN`. On the other hand, if a value is missing becuase it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row.

Let's work through an example. Looking at the number of missing values in the `landslide_data` dataframe, I notice that the column `country` has a lot of missing values in it:

In [5]:
# look at the # of missing points in the column which tells countries
missing_values_count['country']

608

By looking at the documentation, I can see that this column has information on the location where the landslide occured. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NaN's. We can try implying those values from `nearest_pl` column.

In [6]:
# get all the unique values in the 'nearest_pl' column

nearest_pl = landslide_data['nearest_pl'].unique()
for element in nearest_pl:
    print(element)

Grove Street from Anderson Avenue to Hine Hill Road, New Milford, CT
Borneo, Muara
Ocean Falls, B.C.
road to Holberg, 3 km from hwy 19, Vancouver Island, BC
Rennell Sound Road
main road in Port Alice and Neucel Pulp Mill, Rumble mountain, Vancouver Island, BC
Wrangell-St. Elias National Preserve, Chisana, Ak
Fort McNeill, Vancover Island, British Colombia
Kingcome Inlet, ON
four villages in Lempake Jaya, North Samarinda, East Kalimantan Province
Petir village, Darmaga sub district, Bogor district, West Java
Teslin, Yukon territory ( between Junction 37 and Teslin)
Majo Kampung BarijeÂ , North Sumatra, Nias
Quellouno
Mt La Perouse Glacier Bay National Park and Preserve, Alaska 99826
Daniel'S Harbour, Nl
Western Newfoundland town, Daniel's Harbour
Hmawzizar Jade Mine in Phakant, Pha-Kant in Kachin state
Hpakant
road between Hsaidaung(?) and Hpakant, Kachin
Hpakan, Kachin
Redditt, Ontario
Sakhan Thit village in Kyun Su township
Mount Haast
Mt. Haast
Mt Haast
Auraki-Mount Cook National Par

Bishkek-Osh road, Jalal-Abad region
US 160, La Veta Pass
State Highway 1 south of Big Sur, CA
Manali Kaza-Manali Road near Kunzum Pass, Spiti Valley, Himachal Pradesh
Highway 1, 17 mi north of Monterey County and San Luis Obispo County line, CA
Bungadobhan, Baglung, Western Region
Ishok Chingphu(?), Bishnupur, Assam
SR 112 on Juan de Fuca Scenic Byway
Gircha and Sarteez(Sartiz) villages, Gojal Tehsil, gilgit-baltistan
Sigou Village, Loufan County, Shanxi Province
port in Fangchenggang City, Guangxi, Zhuang Autonomous Region
Thukima-6, Taplejung
Shut down Ghurmi-Okhaldunga, Okhaldhunga-Rumjatar, Okhaldunga-Solu in Okhaldhunga
Okhaldhunga district
Gaurikund, road to Kedarnath, Garwhal division, Uttarakhand
Nungba, Tamenglong, Manipur
Fall Creekâ, Boise National Forest, Idaho
Udayapur district
Highway 31 near Woodbury Creek
flooding and landslides across country
Imphal-Jiribam section of NH-37 , Manipur
Mumra Village, Kalikot District, Western Region
Bobang Vdc, Baglung, Western Region


Gadgaron village of Matnog town, Sorsogon
Route 7 Between Alder And La Grande, Eatonville, Wa
3370 Glenwood Dr., Scott's Valley, CA
Granby landfill
Huanu Huanuni, Bella Vista, south of La Paz
Hunan Province, (Large spatial extent)
Ormonli village of Sirvan town of Siirt province
Barangay Sagrada in Viga town
Belandingan village, Bangli district, Bali
Mto Wa Mbu, Arusha,
Guangyuan city, Sichuan Province
Luojiang river, Chengkou county, Chongqing municipality
1 Chapman'S Peak Drive, Capetown
Surigao Sur,  Agusan del Sur
Vanua Levu
Tempi Valley intersection
Opol
Timpanogos Cave National Monument, Utah 92, American Fork, Ut
Threlkeld, near Keswick
Bath, Route 112, New Hampshire
Toowoomba
road from Pinjore(Pinjaur), Haryana to Nalagarh, Himachal Pradesh
Singnakorn(Singhanakhon) district, Songkhla province
Bartung-Ramdi Section, Siddhartha Highway
Portland, the road from Buff Bay to White Hall
Taplin Freewill Baptist Church, Taplin, Logan County, WV
Longhai railway between Tongguan and Taiya

Newcastle Golf Club Road, between 136th Avenue Southeast and 155th Avenue Southeast, Newcastle, WA
Maxatawny-Greenwich, north of Kutztown, Berks county, PA
Railway line south of Kaikoura, New Zealand
Dosquebradas, Risaralda
Fairhope, Alabama
Manilla
Interstate 70, Palisade, Western Co
provincial roads in Itogon, Benguet
Leivi
Guandao township, Tongcheng county, Xianning city, Hubei
Montebonito(?), Marulanda municipality, Caldas
Ankang
Ruzapfu valley, Kikruma Village, Phek District, Nagaland
Manukau
Luxi County, Hunan Province
Columbia Parkway, near Kemper, TN
Harris Park Train Station, Sydney
Moraga Avenue, Piedmont CA
Manchioneal, Portland parish
Lenneng(Liyyeng, barangay of Kabugao)-Kabugao rd, Apayao province, CAR
Mahaplag, Leyte
Barangay Telim,Calatrava
West Busway between Ingram and Sheraton stations, near Berry Street Tunnel, Pittsburgh, PA
Sitio Acub(?), San Isidro village(?), Koronadal city, South Cotabato province
Tacloban City in nearby Leyte province
TaclobanÂ City, Leyte
Jo

We see that the data is rather diverse and can be used as a quiery to, e.g., Google Maps.

In [7]:
g = nearest_pl
geocoder = Nominatim()
for i in range(len(g)):
    if g[i] != np.nan and type(geocoder.geocode(g[i],
                                                exactly_one=True,
                                                language='English',
                                                timeout=5)) != type(None):
        
        print(geocoder.geocode(g[i],
                               exactly_one=True,
                               timeout=5,
                               limit=None,
                               addressdetails=True,
                               language='English',
                               geometry=None,
                               extratags=False,
                               country_codes=False,
                               viewbox=None,
                               bounded=None,).raw['address']['country'])

  


Brunei
Canada
Canada
Perú
Canada
မြန်မာ
မြန်မာ
Canada
New Zealand / Aotearoa
New Zealand / Aotearoa
New Zealand / Aotearoa
Fiji
India
India
Vanuatu
India
Viti
India
中国
Viti
Ísland
Brasil
Malaysia
中国
USA
မြန်မာ
USA
中国
USA
ایران
Canada
Australia
USA
India
United States of America
Argentina
New Zealand / Aotearoa
India
Nepal
Brasil
Fiji
Malaysia
New Zealand / Aotearoa
USA
Canada
Australia
USA
Canada
नेपाल
Canada
پاکستان
India
中国
New Zealand / Aotearoa
Canada
Portugal
Canada
USA
India
Malaysia
India
Canada
Colombia
Myanmar
New Zealand / Aotearoa
USA
United States of America
España
नेपाल
India
UK
नेपाल
India
中国
Nepal
India
Kenya
USA
Canada
नेपाल
Papua Niugini
Malaysia
India
India
नेपाल
नेपाल
नेपाल
New Zealand / Aotearoa
USA
India
Brasil
India
India
Brasil
India
Australia
NKRI
China 中国
中国
India
USA
Кыргызстан
USA
नेपाल
नेपाल
नेपाल
नेपाल
India
Eesti
NKRI
नेपाल
中国
Australia
NKRI
Australia
USA
India
India
افغانستان
Canada
USA
नेपाल
Malaysia
Canada
United Kingdom
Brasil
India
NKRI
Brasil
中国
नेपा

GeocoderQuotaExceeded: HTTP Error 429: Too Many Requests

All in all, we've learnt how to extract countries from some random descriptions of places:)
However, we would need an advanced API which is probably not free in order to not have a restriction on the number of queries.