# Extracción de una página web de información
## Importación de las librerías necesarias
Importamos las librerías necesarias

In [1]:
import urllib3
import webbrowser
from bs4 import BeautifulSoup
import pandas as pd

## Recuperamos la información
Veamos primero en una nueva página del navegador lo que queremos descargar

In [None]:
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
webbrowser.open(wiki)

Ahora obtenemos la respuesta del servidor (`response`) y cargamos el HTML con una librería capaz de procesarlo (`BeautifulSoup`)

In [None]:
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
http = urllib3.PoolManager()
response = http.request('GET', wiki)
soup = BeautifulSoup(response.data.decode('utf-8'), "lxml")
print(soup.prettify())

## Trabajando con los tags de HTML
Obtenemos el tag &lt;title&gt;

In [4]:
soup.title

<title>List of state and union territory capitals in India - Wikipedia</title>

Obtenemos la cadena que contiene el tag &lt;title&gt;

In [5]:
soup.title.string

'List of state and union territory capitals in India - Wikipedia'

Obtiene el primer tag &lt;a&gt; 

In [8]:
soup.a

<a id="top"></a>

Obtiene todos los tags &lt;a&gt; 

In [None]:
all_links = soup.find_all("a")
all_links

Vamos a iterar por los enlaces

In [None]:
for link in all_links:
    print(link.get("href"))

Busquemos una tabla concreta

In [None]:
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table

Ahora viene un poco de manipulación

In [None]:
A, B, C, D, E, F, G = [[] for _ in range(7)]

for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states = row.findAll('th') 
    if len(cells)==6: 
        B.append(states[0].find(text=True))
        for v, w in zip([A, C, D, E, F, G], range(6)):
            v.append(cells[w].find(text=True))

for _ in [A, B, C, D, E, F, G]:
    print(_)

Feo, ¿no?, Ahora ponemos la información de pandas. 

In [21]:
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df

Unnamed: 0,Number,State/UT,Admin_Capital,Legislative_Capital,Judiciary_Capital,Year_Capital,Former_Capital
0,1,Andaman and Nicobar Islands,Port Blair,Port Blair,Kolkata,1955,Calcutta (1945–1955)
1,2,Andhra Pradesh,Hyderabad,Amaravati,Hyderabad,1956,Kurnool
2,3,Arunachal Pradesh,Itanagar,Itanagar,Guwahati,1986,
3,4,Assam,Dispur,Guwahati,Guwahati,1975,Shillong
4,5,Bihar,Patna,Patna,Patna,1912,
5,6,Chandigarh,Chandigarh,—,Chandigarh,1966,—
6,7,Chhattisgarh,Naya Raipur,Raipur,Bilaspur,2000,—
7,8,Dadra and Nagar Haveli,Silvassa,—,Mumbai,1945,Mumbai (1954–1961)
8,9,Daman and Diu,Daman,—,Mumbai,1987,Ahmedabad
9,10,National Capital Territory of Delhi,New Delhi,New Delhi,New Delhi,1931,—


Pero todos los datos son cadenas

In [23]:
df[['Year_Capital']] = df[['Year_Capital']].apply(pd.to_numeric, errors='raise')

Pero podemos extraer información y despues guardarla en un fichero

In [24]:
result = df.query('Year_Capital > 1930 and  Year_Capital < 1960')
result

Unnamed: 0,Number,State/UT,Admin_Capital,Legislative_Capital,Judiciary_Capital,Year_Capital,Former_Capital
0,1,Andaman and Nicobar Islands,Port Blair,Port Blair,Kolkata,1955,Calcutta (1945–1955)
1,2,Andhra Pradesh,Hyderabad,Amaravati,Hyderabad,1956,Kurnool
7,8,Dadra and Nagar Haveli,Silvassa,—,Mumbai,1945,Mumbai (1954–1961)
9,10,National Capital Territory of Delhi,New Delhi,New Delhi,New Delhi,1931,—
14,15,Jammu and Kashmir,Srinagar,Srinagar (Summer),Srinagar (Summer),1947,—
16,17,Karnataka,Bengaluru,Bengaluru,Bengaluru,1940,(Mysore)
17,18,Kerala,Thiruvananthapuram,Thiruvananthapuram,Kochi,1956,
18,19,Lakshadweep,Kavaratti,Kavaratti,Kochi,1956,
19,20,Madhya Pradesh,Bhopal,Bhopal,Jabalpur,1956,Nagpur
21,22,Manipur,Imphal,Imphal,Imphal,1947,—


In [25]:
result.to_csv('data.csv', sep=';', encoding='utf-8')