## The notebook is working with the Global Landslide Catalog dataset
### Here we explore the dataset and finding the columns that are not important and can be dropped off

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io

import warnings

In [6]:
data = pd.read_csv('rows.csv')

In [14]:
data.columns

Index(['the_geom', 'OBJECTID', 'id', 'date_', 'time_', 'country', 'nearest_pl',
       'hazard_typ', 'landslide_', 'trigger', 'storm_name', 'fatalities',
       'injuries', 'source_nam', 'source_lin', 'location_a', 'landslide1',
       'photos_lin', 'cat_src', 'cat_id', 'countrynam', 'near', 'distance',
       'adminname1', 'adminname2', 'population', 'countrycod', 'continentc',
       'key_', 'version', 'user_id', 'tstamp', 'changeset_', 'latitude',
       'longitude'],
      dtype='object')

In [131]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6788 entries, 0 to 6787
Data columns (total 27 columns):
id            6788 non-null int64
date_         6784 non-null object
country       6180 non-null object
nearest_pl    6724 non-null object
hazard_typ    6788 non-null object
landslide_    6788 non-null object
trigger       6788 non-null object
fatalities    6788 non-null int64
injuries      6788 non-null int64
source_nam    2495 non-null object
source_lin    6283 non-null object
location_a    6788 non-null object
landslide1    6786 non-null object
photos_lin    289 non-null object
cat_src       6787 non-null object
cat_id        6788 non-null int64
countrynam    6788 non-null object
near          6787 non-null object
distance      6788 non-null float64
adminname1    6717 non-null object
adminname2    3794 non-null object
population    6788 non-null int64
countrycod    0 non-null float64
continentc    4617 non-null object
key_          6786 non-null object
latitude      6788 non-nu

In [15]:
data = data.drop(columns=['time_', 'storm_name', 'OBJECTID', 'the_geom'])

In [18]:
data = data.drop(columns=['tstamp','changeset_','user_id', 'version'])

In [21]:
data.hazard_typ = 'Landslide'

In [22]:
data.hazard_typ.unique()

array(['Landslide'], dtype=object)

### Cleaned  the data from useless columns and fixed `Landslide` typo

In [25]:
data[['country', 'nearest_pl','location_a', 'cat_src', 'cat_id', 'countrynam', 'near', 'adminname1', 'adminname2']]

Unnamed: 0,country,nearest_pl,location_a,cat_src,cat_id,countrynam,near,adminname1,adminname2
0,United States,Grove Street from Anderson Avenue to Hine Hill...,Known_within_1_km,glc,3177,United States,New Milford,Connecticut,Litchfield County
1,Indonesia,"Borneo, Muara",Unknown,glc,490,Indonesia,Longnawang,North Kalimantan,
2,,"Ocean Falls, B.C.",Known_within_1km,glc,6760,Canada,Kitimat,British Columbia,obe
3,Canada,"road to Holberg, 3 km from hwy 19, Vancouver I...",Known_within_1_km,glc,2494,Canada,Campbell River,British Columbia,
4,,Rennell Sound Road,Known_within_15km,glc,6415,Canada,Prince Rupert,British Columbia,obe
5,Canada,"main road in Port Alice and Neucel Pulp Mill, ...",Known_within_5_km,glc,2493,Canada,Campbell River,British Columbia,
6,United States,"Wrangell-St. Elias National Preserve, Chisana, Ak",Known_within_5_km,test,5194,United States,Tok,Alaska,obe
7,Canada,"Fort McNeill, Vancover Island, British Colombia",Known_within_5_km,glc,4066,Canada,Campbell River,British Columbia,
8,Canada,"Kingcome Inlet, ON",Known_within_25_km,glc,2505,Canada,Campbell River,British Columbia,
9,Indonesia,"four villages in Lempake Jaya, North Samarinda...",Known_within_25_km,glc,945,Indonesia,Sungaiboh,North Kalimantan,


In [26]:
data.columns

Index(['id', 'date_', 'country', 'nearest_pl', 'hazard_typ', 'landslide_',
       'trigger', 'fatalities', 'injuries', 'source_nam', 'source_lin',
       'location_a', 'landslide1', 'photos_lin', 'cat_src', 'cat_id',
       'countrynam', 'near', 'distance', 'adminname1', 'adminname2',
       'population', 'countrycod', 'continentc', 'key_', 'latitude',
       'longitude'],
      dtype='object')

### Here I want to determine the source of news based on the website provided

The idea is to create a function that takes a large url (e.g. http://www.newstimes.com/local/article/New-Milford-Kent-hit-hard-by-flooding-1046004.php) as an argument and returns the name or description of the source, such as `The New York Times`

#### Examples of URLs

In [92]:
for ws in data.source_lin[:10]:
    print(ws)

http://www.newstimes.com/local/article/New-Milford-Kent-hit-hard-by-flooding-1046004.php
http://www.brunei-online.com/bb/tue/apr1h10.htm
http://globalnews.ca/news/1818913/mudslide-splits-town-of-ocean-falls-in-half/
http://www.theprovince.com/news/State+emergency+Port+Hardy/3581434/story.html
http://www.cftktv.com/News/Story.aspx?ID=2164634
http://www.vancouversun.com/news/Rains+cause+mudslide+power+outages+Vancouver+Island/3579800/story.html
http://www.alaskadispatch.com/article/20130914/summer-heatwave-may-have-triggered-landslide-lonely-alaska-glacier
http://www.globaltvbc.com/mudslide+closes+section+of+highway+19+near+port+mcneill/6442530653/story.html
http://www.newswire.ca/en/releases/archive/September2010/27/c5513.html
http://www.thejakartaglobe.com/news/article/4210.html


### The approach is to parse the url and extract the domain name, then make a YAHOO search request with a keyword that is equals to a domain name of the website and extract the title of the 1st result of the search engine

In [118]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import time

def get_title(word):
    url = "http://search.yahoo.com/search?p="
    r = requests.get(url + word) 
    time.sleep(0.2)
    soup = BeautifulSoup(r.text, 'lxml')
    links = soup.find_all(attrs={"class": "title"})
    if len(links) > 0:
        if links[0].text == 'Including results for ':
            return links[1].text
        else:
            return links[0].text

def get_website(url):

    parsed_uri = urlparse(url)
    result = '{uri.netloc}'.format(uri=parsed_uri)
    #result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    return result


### Checking an approach... printing out the source name and the website name

In [116]:
for index, row in data.iterrows():
    if pd.isnull(row['source_nam']) and not pd.isnull(row['source_lin']):
        print(get_title(get_website(row['source_lin'])), get_website(row['source_lin']))
        row['source_nam'] = get_title(get_website(row['source_lin']))

NewsTimes: Greater Danbury Area News, Fairfield County News ... www.newstimes.com
Brunei Online Store - 1 Stop E-Commerce Store in Brunei .... www.brunei-online.com
The Province www.theprovince.com
Vancouver Sun www.vancouversun.com
Global BC www.globaltvbc.com
Newswire www.newswire.ca
The Jakarta Globe www.thejakartaglobe.com
The Jakarta Post www.thejakartapost.com
KTVA www.ktva.com
news.com.au www.news.com.au
Reuters www.reuters.com
The Philippine Star www.philstar.com
GNS Science www.gns.cri.nz
GNS Science www.gns.cri.nz
ThreeNow www.tv3.co.nz
Radio Fiji Two - FM 105.2 - Suva - Listen Online www.radiofiji.com.fj
The Jakarta Post www.thejakartapost.com
SIKKIM.COM isikkim.com
The Star thestar.com.my
Zee News zeenews.india.com
AlterNet www.alertnet.org
Radio Fiji Two - FM 105.2 - Suva - Listen Online www.radiofiji.com.fj
News18.com: CNN-News18 Breaking News India, Latest News ... ibnlive.in.com
HimVani - The Voice of Himachal www.himvani.com
The Advertiser | Latest Adelaide and South A

Channel 12 KTRV FOX | Nampa | Media | Services www.fox12idaho.com
गृहपृष्ठ - Gorkhapatra www.gorkhapatra.org.np
ekantipur www.kantipuronline.com
anhuitoday.com english.anhuinews.com
China Daily www.chinadaily.com.cn
Latin American Herald Tribune www.laht.com
Colombia News TV | RCN Networks www.colombianews.tv
Yakima Herald-Republic www.yakima-herald.com
CBS Denver denver.cbslocal.com
Taiwan News Online － Breaking News, Politics, Environment ... www.etaiwannews.com
The Tribune www.tribuneindia.com
The Hindu www.hindu.com
The Jakarta Post www.thejakartapost.com
BBC News news.bbc.co.uk
ekantipur www.kantipuronline.com
ReliefWeb www.reliefweb.int
The Mercury www.themercury.com.au
Australian Broadcasting Corporation www.abc.net.au
Nepalnews : Nepal's first online news portal www.nepalnews.com
The Himalayan Times www.thehimalayantimes.com
ReliefWeb www.reliefweb.int
HeraldNet.com - Everett and Snohomish County news from The Herald www.heraldnet.com
Iceland Review www.icelandreview.com
Daily 

IndexError: list index out of range

### Checking the approach on the other part of the data

In [133]:
for index, row in data.iterrows():
    if pd.isnull(row['source_nam']) and not pd.isnull(row['source_lin']):
        print(get_title(get_website(row['source_lin'])), get_website(row['source_lin']))
        row['source_nam'] = get_title(get_website(row['source_lin']))

NewsTimes: Greater Danbury Area News, Fairfield County News ... www.newstimes.com
Brunei Online Store - 1 Stop E-Commerce Store in Brunei .... www.brunei-online.com
The Province www.theprovince.com
Vancouver Sun www.vancouversun.com
Global BC www.globaltvbc.com
Newswire www.newswire.ca
The Jakarta Globe www.thejakartaglobe.com
The Jakarta Post www.thejakartapost.com
KTVA www.ktva.com
news.com.au www.news.com.au
Reuters www.reuters.com
The Philippine Star www.philstar.com
GNS Science www.gns.cri.nz
GNS Science www.gns.cri.nz
ThreeNow www.tv3.co.nz
Radio Fiji Two - FM 105.2 - Suva - Listen Online www.radiofiji.com.fj
The Jakarta Post www.thejakartapost.com
SIKKIM.COM isikkim.com
The Star thestar.com.my
Zee News zeenews.india.com
AlterNet www.alertnet.org
Radio Fiji Two - FM 105.2 - Suva - Listen Online www.radiofiji.com.fj
News18.com: CNN-News18 Breaking News India, Latest News ... ibnlive.in.com
HimVani - The Voice of Himachal www.himvani.com
The Advertiser | Latest Adelaide and South A

ekantipur www.kantipuronline.com
Channel 12 KTRV FOX | Nampa | Media | Services www.fox12idaho.com
गृहपृष्ठ - Gorkhapatra www.gorkhapatra.org.np
ekantipur www.kantipuronline.com
anhuitoday.com english.anhuinews.com
China Daily www.chinadaily.com.cn
Latin American Herald Tribune www.laht.com
Colombia News TV | RCN Networks www.colombianews.tv
Yakima Herald-Republic www.yakima-herald.com
CBS Denver denver.cbslocal.com
Taiwan News Online － Breaking News, Politics, Environment ... www.etaiwannews.com
The Tribune www.tribuneindia.com
The Hindu www.hindu.com
The Jakarta Post www.thejakartapost.com
BBC News news.bbc.co.uk
ekantipur www.kantipuronline.com
ReliefWeb www.reliefweb.int
The Mercury www.themercury.com.au
Australian Broadcasting Corporation www.abc.net.au
Nepalnews : Nepal's first online news portal www.nepalnews.com
The Himalayan Times www.thehimalayantimes.com
ReliefWeb www.reliefweb.int
HeraldNet.com - Everett and Snohomish County news from The Herald www.heraldnet.com
Iceland Re

KeyboardInterrupt: 

#### Comment: 
After 500 requests to YAHOO service their server bans my ip due to "too many requests"

## Summary:
The approach works well and can be scaled to other data sets