# Gathering Data  
---
### Pulling articles from  [reliefweb.int](https://reliefweb.int/)
Used the search "Punjab flood" with advanced search of  `primary country` filled in as "India". 

Reliefweb calls itself the "leading online source for reliable and timely humanitarian information on global crises and disasters since 1996." Reliefweb is a digital service of the Coordination of Humanitarian Affairs (OCHA) of the United Nations. It monitors and gathers information from 4,000 global sources which include various international papers, government statments, social media posts, and humanitarian voluneteer organizations. 

Aid referenced: [link](https://www.geeksforgeeks.org/get-post-requests-using-python/)

In [1]:
# Import libraries
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

The reliefweb search converter, found [here](https://reliefweb.github.io/search-converter/), was used to convert the search query of "(punjab flood) AND primary_country.id:119" to the url used for the request.   

In [2]:
# url of search
url = "https://api.reliefweb.int/v1/reports?appname=apidoc&limit=1000&profile=list&preset=latest&slim=1&query[value]=(punjab%20flood)%20AND%20primary_country.id%3A119&query[operator]=AND"

res = requests.get(url)

In [3]:
# check
res.status_code

200

In [4]:
# convert to json type
jsondata = res.json()
data = jsondata['data']

JSON is a data type which stands for JavaScript Object Notation. The raw response is converted to JSON because it can then be converted to dictionary objects to easily be used in Python. 

In [5]:
# create a list called urls with the url which came up from the search 
urls = [ item['fields']['url'] for item in data]

In [6]:
# check length of the urls (there should be 427 since that was how many articles were pulled )
len(urls)

427

In [7]:
soup = BeautifulSoup(res.content, "lxml")

In [None]:
# pulling the articles from the urls from the api search 
# only pulling the title, date, and body of each article 
articles = []

for url in urls:
    data = {}
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")

    data['title'] = soup.find("h1", {"class": "node-title clearfix"}).text
    data['date'] = soup.find("span", {"class": "date-display-single"}).text
    data['body'] = soup.find("div", {"class": "field body"}).text
    
    articles.append(data)

In [None]:
# create dataframe of articles pulled from above
articles_df= pd.DataFrame(articles)
# add the urls connected with each article
articles_df['url'] = urls
# turn the dataframe into a csv file
articles_df.to_csv("./reliefweb_floods.csv", index = False)

In [None]:
# check
articles_df.tail()

In [None]:
# check
articles_df.head()

In [None]:
# imports to change date which is a string into a datetime type
from datetime import datetime
from dateutil.parser import parse
import pandas as pd

In [None]:
# append a column with datetime of when the article was released 
articles_df['datetime'] = articles_df['date'].apply( lambda x: datetime.strptime(x,'%d %b %Y'))

In [None]:
# check
articles_df.head()

In [None]:
# test
datetime(2011, 7, 2, 0, 0)

In [None]:
# mask for articles starting with jan 1, 2007 and after
is_after_2006 = articles_df["datetime"] >= datetime(2007, 1, 1, 0, 0)

In [None]:
# dataframe of articles after 2006
articles_after06_df = articles_df.loc[is_after_2006, :]

In [None]:
articles_after06_df.to_csv("../data/reliefweb_floods_after06.csv", index = False)

**Summary** In this notebook, articles were pulled from the reliefweb.int website related to "punjab flood" with the advanced search of `primary country` set to "India". This search resulted in 427 urls which were then scrapped using BeautifulSoup to grab the titles, dates of publication, and body of text. This was converted to a csv file named reliefweb_floods.csv. Another csv file was created excluding all articles published before January 1, 2007. This was saved as a csv file named reliefweb_floods_after06.csv