![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

### Phase 1: Web Scraping with Selenium

To retrieve the information from the API, it is necesary to get the IDs from the supported SDKs listed in the API's website. The information is included within a 'react text' element, therefore the most suitable way to get the data was using the Python library Selenium. Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.

References:

[1] https://selenium-python.readthedocs.io/locating-elements.html

[2] https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72

In [None]:
'''
Before starting, it is necesary to add to PATH the path location where the 'geckodriver.exe' is saved. The
used in this project is the temporary add of the executable using: $ export PATH=$PATH:/home/potacho/apps/
It is possible to check the result using: $ cho $PATH
'''
#First step: 'webdriver' creation.
from selenium import webdriver
browser = webdriver.Firefox()
browser.implicitly_wait(10)
browser.get('https://42matters.com/docs/app-market-data/supported-sdks')

In [None]:
'''
The website uses different visualization layers within the same url using 'react text'. With the method 
proposed, we can retrieve the elements that are showing at the moment this piece of code is running.
Therefore, a print line is included to check if the total amount of elements is correct. Also we store the
'total' variable for further steps.
'''
#Second step: calculation of total number of sdks elements by category. 
total_element = browser.find_elements_by_xpath('//*[@id="sdk-doc-app"]/div/div/div/div[1]')
total_string = [x.text for x in total_element][0]
total = [int(s) for s in total_string.split() if s.isdigit()][0]
print('Total amount of sdks to scrap: {}'.format(total))

In [None]:
def get_sdks_elements(t):
    sdks_elements = []
    for n in range(1,t+1):
        sdk = browser.find_elements_by_xpath('//*[@id="sdk-doc-app"]/div/div/div/div[2]/div/div[%s]'%n)
        sdks_elements.append(sdk)
    return sdks_elements

In [None]:
elements = get_sdks_elements(total)
print('Total sdks elements scraped: {}'.format(len(get_sdks_elements(total))))

In [None]:
sdks_list = []
for element in range(total):
    sdks_list.append([x.text.replace('\nID: ',',').replace('\n',',') for x in elements[element]])

#print("sdks:")
#print(sdks,'\n')
sdks = [item for sublist in sdks_list for item in sublist]

In [None]:
sdks[0]

In [None]:
#Function: .csv file generator. Arg1: list of sdks strings. Arg2: .csv file name. 
def write_to_csv(list_of_sdks,file_name):
    import sys 
    import os
    import csv
    with open(file_name, 'w') as csvfile:
        writer = csv.writer(csvfile, delimiter='\n')
        writer.writerow(list_of_sdks)

In [None]:
write_to_csv(sdks,'development_tool.csv')

In [2]:
def sdks_to_pandas(csv_file):
    import pandas as pd
    cols = ['sdk_name','sdk_id','sdk_cat1','sdk_cat2']
    dataframe = pd.read_csv(csv_file, sep=',', header=None, index_col=False, names = cols)
    return dataframe

In [3]:
df_backend = sdks_to_pandas('backend.csv')
display('Dataframe correctly generated')
display('Dataframe shape: {}'.format(df_backend.shape),df_backend.head())

'Dataframe correctly generated'

'Dataframe shape: (5, 4)'

Unnamed: 0,sdk_name,sdk_id,sdk_cat1,sdk_cat2
0,Amazon AWS,amazon-aws,Backend,
1,Firebase,firebase,Backend,Development Tool
2,Parse,parse,Backend,Development Tool
3,Particle SDK,particle-sdk,Backend,
4,WNS SDK,wns-sdk,Backend,Development Tool


In [5]:
sdks_list = list(df_backend.sdk_id)
sdks_url = ",".join(sdks_list)
sdks_url

'amazon-aws,firebase,parse,particle-sdk,wns-sdk'

In [None]:
#sdks_elements = browser\
#.find_elements_by_xpath("//div[@class='sdk-list__grid__tile__title']")
sdks = browser.find_elements_by_xpath('//*[@id="sdk-doc-app"]/div/div/div/div[2]/div/div[1]/p')

#sdks_elements = browser.find_element_by_tag_name('p')
#sdks_elements = browser.find_elements_by_css_selector("<!-- react-text: 671 -->")

#sdks_elements = [item for sublist in sdks for item in sublist]
#sdks_elements
#//*[@id="sdk-doc-app"]/div/div/div/div[2]/div/div[1]/p/text()