# Selfie Project

This project for the Lede program 2021, should demonstrate what we learned this week in webscraping and html & Javascript.

## Wikipedia page to scrape

There were no readily available datasets to download for this topic, but I found a useful page on Wikipedia, where a lot of those accidents were recorded and with the latest data coming from 2021. 

Source: [List of selfie-related injuries and deaths](https://en.wikipedia.org/wiki/List_of_selfie-related_injuries_and_deaths#cite_note-:1-5)

## Reading the web page into Python

In [7]:
# importing the necessary libraries 
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [4]:
# fetch web page from the URL and store the result in a "response" object called r
# response object has a text attribute, which contains the same HTML code from our web browser

r = requests.get('https://en.wikipedia.org/wiki/List_of_selfie-related_injuries_and_deaths')

In [5]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of selfie-related injuries and deaths - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c32d0baa-5d63


### Parsing the HTML using Beautiful Soup

In [8]:
# parse the HTML (stored in r.text) into a special object called soup that the Beautiful Soup library understands
soup = BeautifulSoup(r.text, 'html.parser')

In [9]:
# finds the title tag
soup.title

<title>List of selfie-related injuries and deaths - Wikipedia</title>

In [12]:
# find how many table tags exist --> There are 2 tables on that page
len(soup.find_all('table'))

2

### Ask Beautiful Soup to find all of the records


In [21]:
# find where the data you want resides (in the table tag)
sp_table = soup.find_all('table')

In [24]:
# look at the 2 tables and find out which one has the needed data: table[0] 
sp_table[0]

<table class="wikitable sortable">
<tbody><tr>
<th scope="col">Date
</th>
<th scope="col">Country
</th>
<th scope="col">Injuries/Casualties
</th>
<th scope="col">Type
</th>
<th class="unsortable" scope="col">Description
</th>
<th class="unsortable" scope="col">Source(s)
</th></tr>
<tr>
<td><span data-sort-value="000000002011-10-15-0000" style="white-space:nowrap">15 October 2011</span>
</td>
<td>United States
</td>
<td>3
</td>
<td>Transport
</td>
<td>Three teenagers (two sisters and a friend) were killed by a train while posing for a selfie that was found on their phone. Shortly before, they posted the message "Standing right by a train ahaha this is awesome!!!!" to <a href="/wiki/Facebook" title="Facebook">Facebook</a>.
</td>
<td><sup class="reference" id="cite_ref-8"><a href="#cite_note-8">[8]</a></sup><sup class="reference" id="cite_ref-9"><a href="#cite_note-9">[9]</a></sup>
</td></tr>
<tr>
<td><span data-sort-value="000000002014-03-01-0000" style="white-space:nowrap">March 2014</s

In [25]:
# save it as the new soup
sp_table = sp_table[0]

In [28]:
# find_all tr (table rows)

sp_trs = sp_table.find_all('tr')

In [75]:
# separate the first tr tag row for the header
sp_th = sp_trs[0].find_all('th')
sp_header = []
for th in sp_th:
    sp_header.append(th.text)

In [76]:
sp_header

['Date\n',
 'Country\n',
 'Injuries/Casualties\n',
 'Type\n',
 'Description\n',
 'Source(s)\n']

In [38]:
# take a look at the html structure for on row in the table
sp_trs[1] 

<tr>
<td><span data-sort-value="000000002011-10-15-0000" style="white-space:nowrap">15 October 2011</span>
</td>
<td>United States
</td>
<td>3
</td>
<td>Transport
</td>
<td>Three teenagers (two sisters and a friend) were killed by a train while posing for a selfie that was found on their phone. Shortly before, they posted the message "Standing right by a train ahaha this is awesome!!!!" to <a href="/wiki/Facebook" title="Facebook">Facebook</a>.
</td>
<td><sup class="reference" id="cite_ref-8"><a href="#cite_note-8">[8]</a></sup><sup class="reference" id="cite_ref-9"><a href="#cite_note-9">[9]</a></sup>
</td></tr>

In [113]:
# for each tr, find tds then for each td get text inside, then save to new array
sp_list = []
for tr in sp_trs[1:]:
    tds = tr.find_all('td')
    tr_list = []
    for (i, td) in enumerate(tds):
        # if it's the sixth column, get the href link instead of the text
        if(i == 5):
            tr_list.append(td.find_all('sup'))
        else:
            tr_list.append(td.text)
    sp_list.append(tr_list)

### Building the dataset

In [114]:
# applying a tabular data structure using pandas

sp_df = pd.DataFrame(sp_list, columns=sp_header)

In [115]:
sp_df.head(5)

Unnamed: 0,Date\n,Country\n,Injuries/Casualties\n,Type\n,Description\n,Source(s)\n
0,15 October 2011\n,United States\n,3\n,Transport\n,Three teenagers (two sisters and a friend) wer...,"[[[[8]]], [[[9]]]]"
1,March 2014\n,Spain\n,1\n,Electrocution\n,A 21-year-old man was electrocuted after climb...,[[[[10]]]]
2,March 2014\n,Russia\n,1\n,Transport\n,A train driver saw two people near the train t...,[[[[11]]]]
3,April 2014\n,United States\n,1\n,Transport\n,A 32-year-old woman from North Carolina was dr...,"[[[[12]]], [[[13]]]]"
4,22 April 2014\n,Russia\n,1\n,Fall\n,A 17-year-old girl fell 30 ft to her death aft...,[[[[14]]]]


In [93]:
sp_df= sp_df.replace('\n', '')
sp_df['Date\n']

0       15 October 2011\n
1            March 2014\n
2            March 2014\n
3            April 2014\n
4         22 April 2014\n
              ...        
183     12 January 2020\n
184       30 April 2020\n
185    12 December 2020\n
186     12 January 2021\n
187         16 May 2021\n
Name: Date\n, Length: 188, dtype: object

In [48]:
sp_df.head(5)

Unnamed: 0,Date\n,Country\n,Injuries/Casualties\n,Type\n,Description\n,Source(s)\n
0,15 October 2011\n,United States\n,3\n,Transport\n,Three teenagers (two sisters and a friend) wer...,
1,March 2014\n,Spain\n,1\n,Electrocution\n,A 21-year-old man was electrocuted after climb...,
2,March 2014\n,Russia\n,1\n,Transport\n,A train driver saw two people near the train t...,
3,April 2014\n,United States\n,1\n,Transport\n,A 32-year-old woman from North Carolina was dr...,
4,22 April 2014\n,Russia\n,1\n,Fall\n,A 17-year-old girl fell 30 ft to her death aft...,


## Find all the corresponding links

## Export the dataset to a CSV file

In [None]:
# use pandas save it as a csv
sp_df.to_csv('scrapeddata.csv', index=False, encoding='utf-8')