## Data Acquisition 

- [x] Prior to running this notebook, use this [tool](https://petscan.wmflabs.org/) to export a csv file (saved as _Scientists2levels.csv_). Choose Wikipedia titles under the **Category:Scientists**
- [x] In this notebook, open the csv file and use the `title` column as input for extracting the actual articles w/ wikipedia-api
- [x] Save as pandas dataframe

In [None]:
import pandas as pd
import numpy as np

import pickle
import wikipediaapi

### Extract pages from WIKIPEDIA-API

In [None]:
# open csv file
scientistList = pd.read_csv('../data/Scientist2levels.csv')
scientistList.head()

In [None]:
# Collect names of scientists
scientistNames = scientistList['title'].values.tolist()

---

In [None]:
# wikipedia wrapper function 

def getarticles(titles):
    '''Function returns the titles of articles on wikipedia, in the form 
    of a list of dictionaries
    input:
        titles - is list of titles
    '''
    collection =[]
    for each in titles:
        wiki = wikipediaapi.Wikipedia(
                language='en',
                extract_format=wikipediaapi.ExtractFormat.WIKI
        )
        collection.append(wiki.page(each))
        
    return collection
 

In [None]:
# collect articles under the category of scientist
collection = getarticles(scientistNames)

In [None]:
print('total number of articles collected: ', len(collection))

**Unpack Results**
- Collect titles and summaries from wiki articles, convert them into lists, 
- Then, create a pandas dataframe 

In [None]:
# Collect all the titles from dictionary (collection)
# This step will take a while if you have a lot of data
titles = [each.title for each in collection]
summaries = [each.summary for each in collection]

In [None]:
# Pickle these lists, just in case
with open('../data/list_summaries.pkl','wb') as fout:
    pickle.dump(summaries, fout)
    
# # Pickle the lists, just in case
with open('../data/list_titles.pkl','wb') as fout:
    pickle.dump(titles, fout)

In [None]:
# Put these lists in to a dataframe
df = pd.DataFrame(np.c_[titles, summaries], 
                  columns=['title', 'summary'])

In [None]:
df.head()

In [None]:
# Pickle the dataframe for latter processing
with open('../data/dfraw.pkl','wb') as fout:
    pickle.dump(df, fout)

**Next step**:
- Clean the dataframe and do EDA, in Step2_Cleaning.ipynb

---