# Scraping Data
**Dataset Source:** [Disease-Symptom Knowledge Database](https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html)<br>

***About the dataset***<br>
<p style="text-align: justify">
This table below is a knowledge database of disease-symptom associations generated by an automated method based on information in textual discharge summaries of patients at New York Presbyterian Hospital admitted during 2004.  The first column shows the disease, the second the number of discharge summaries containing a positive and current mention of the disease, and the associated symptom. Associations for the 150 most frequent diseases based on these notes were computed and the symptoms are shown ranked based on the strength of association.  The method used the MedLEE natural language processing system to obtain UMLS codes for diseases and symptoms from the notes; then statistical methods based on frequencies and co-occurrences were used to obtain the associations. A more detailed description of the automated method can be found in Wang X, Chused A, Elhadad N, Friedman C, Markatou M. Automated knowledge acquisition from clinical reports. AMIA Annu Symp Proc. 2008. p. 783-7. PMCID: PMC2656103.
</p>
<div style="text-align: right"><i>Taken from <a href="https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html">Columbia University</a></i></div>

Importing the necessary libraries.

In [29]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Define the URL path.

In [40]:
URL = 'https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html'

Read the URL.

In [41]:
soup = BeautifulSoup(requests.get(URL).text, 'lxml')

Find the table from the page.

In [42]:
table = soup.find('table')

Convert the table into a data frame.

In [43]:
df = pd.read_html(str(table))[0]
df

Unnamed: 0,0,1,2
0,Disease,Count of Disease Occurrence,Symptom
1,UMLS:C0020538_hypertensive disease,3363,UMLS:C0008031_pain chest
2,,,UMLS:C0392680_shortness of breath
3,,,UMLS:C0012833_dizziness
4,,,UMLS:C0004093_asthenia
...,...,...,...
1862,,,UMLS:C0425251_bedridden^UMLS:C0741453_bedridden
1863,,,UMLS:C0242453_prostatism
1864,UMLS:C0011127_decubitus ulcer,42,UMLS:C0232257_systolic murmur
1865,,,UMLS:C0871754_frail


Export the data frame as a CSV file.

In [45]:
CSV_FILEPATH = 'dataset/raw_data.csv'
df.to_csv(CSV_FILEPATH, index=False, header=False)