# <div align="center">Webscraping Project</div>

This is my first attempt at webscraping data. For this project I just did a quick search for a website with a simple table. I watched [this video](https://www.youtube.com/watch?v=ng2o98k983k) about 45 times and then still went back and referenced it when I got stuck. First steps, import the necessary libraries, BeautifulSoup4 and requests. 

In [1]:
from bs4 import BeautifulSoup
import requests

Here we are going to import the HTML file and save it as 'content'

In [2]:
results = requests.get('https://www.guru99.com/python-csv.html')
content = results.content

Here we take that HTML file and use the Beautiful Soup library to parse it out into something useable and then prettify it to make it also readable. Not including the prettified version because its too bloody long.

In [3]:
parser = BeautifulSoup(content, 'lxml')


Using the HTML insptector tool in the chrome browser I was able to identify that the table that I had chosen has a tag called 'table' and was in the class 'table table-striped'. I saved that area as text, and printed it to make sure that it was indeed the table I was looking to scrape.

In [4]:
table = parser.find('table', class_='table table-striped')
table_txt = table.text
print(table_txt)


Programming language Designed by Appeared Extension Python Guido van Rossum 1991 .py Java James Gosling 1995 .java C++ Bjarne Stroustrup 1983 .cpp 


Here, I made this HTML prettier so that I could identify the tags that break out the rows and elements of this table


In [5]:
mable = table.prettify()
print(mable)


<table class="table table-striped">
 <tr>
  <td>
   Programming language
  </td>
  <td>
   Designed by
  </td>
  <td>
   Appeared
  </td>
  <td>
   Extension
  </td>
 </tr>
 <tr>
  <td>
   Python
  </td>
  <td>
   Guido van Rossum
  </td>
  <td>
   1991
  </td>
  <td>
   .py
  </td>
 </tr>
 <tr>
  <td>
   Java
  </td>
  <td>
   James Gosling
  </td>
  <td>
   1995
  </td>
  <td>
   .java
  </td>
 </tr>
 <tr>
  <td>
   C++
  </td>
  <td>
   Bjarne Stroustrup
  </td>
  <td>
   1983
  </td>
  <td>
   .cpp
  </td>
 </tr>
</table>



Confirming the the 'tr' tag holds our rows, and printing the results in text format. One thing that makes this a little more complicated is that entries like 'Programming language' and 'Designed by' are made up of more than one element so I cant simply use that text file as the list. I will need a different approach.

In [6]:
rows = table.find('tr')
trows = rows.text
print(trows)

Programming language Designed by Appeared Extension 


Another issue is that I want to use the 'find_all' function to pull out the rows, but doing so puts out a list that I can't use the '.text' command on to clean it up. 

In [7]:
rowses = table.find_all('tr')
print(rowses)

[<tr><td>Programming language </td><td>Designed by </td><td>Appeared </td><td>Extension </td></tr>, <tr><td>Python </td><td>Guido van Rossum </td><td>1991 </td><td>.py </td></tr>, <tr><td>Java </td><td>James Gosling </td><td>1995 </td><td>.java </td></tr>, <tr><td>C++ </td><td>Bjarne Stroustrup </td><td>1983 </td><td>.cpp </td></tr>]


Another method to try later would be to try using the 'td' tag to identify each element and make up my list from there. I could maybe use a for loop to clean each entry and append the result to a new list. The problem there is that I lose my row organization. I would prefer to work with each row and end up with a list of lists.

In [8]:
entries = rowses[0].find('td')
shmintries = entries
print(shmintries)

<td>Programming language </td>


This is where the magic happens! I start a new variable with all the HTML from our table that we identified earlier and then initialize an empty list that will be our final product. I used nested for loops to clean each item of each row and return the table as a list of lists like I was looking to do.

In [9]:
stuff = table.find_all('tr')
new_list = []
for row in stuff:
    row_list = []
    for element in row:
        clean = element.text
        row_list.append(clean)
    new_list.append(row_list)
print(new_list)

[['Programming language ', 'Designed by ', 'Appeared ', 'Extension '], ['Python ', 'Guido van Rossum ', '1991 ', '.py '], ['Java ', 'James Gosling ', '1995 ', '.java '], ['C++ ', 'Bjarne Stroustrup ', '1983 ', '.cpp ']]


This last part is pretty simple with the help of the pandas library. First, I split the list to have the header row saved separately to be used as the column names. Then pandas just assembles the dataframe and it prints out a pretty picture for us! 

In [10]:
import pandas as pd
data = new_list[1:]
columns = new_list[0]

flable = pd.DataFrame(data)
flable.columns = columns
flable.head()

Unnamed: 0,Programming language,Designed by,Appeared,Extension
0,Python,Guido van Rossum,1991,.py
1,Java,James Gosling,1995,.java
2,C++,Bjarne Stroustrup,1983,.cpp
