# Data Extraction from PDF Using Python + Probability Theory Application
<div style="text-align: justify">
The purpose of this project is to take real action on a data-driven project. The project isn't complex at all, but it's sufficient for demonstrating my original intention to apply some basic Statistics and Probability Theory to real-world data. Additionally, by taking real action, I anticipate to encounter more challenges, which will allow even the simpler project to showcase various aspects of my skills and my problem-solving capabilities.

## Environment Management
Please read the project's README for instructions on how to set up the project's environment on your computer. 

In [None]:
# Ensure Jupiter is working at the correct environment to avoid conflicts:
import sys
sys.executable

## Table Exctraction from PDF
<div style="text-align: justify">
The table of interest is the B1 PDF table and located at page 3. Tabula library's results weren't as good as camelot library's results. For this reason, I chose camelot for the pdf's page 3 table extraction. This library also requires PyPDF2 library which is a library for text extraction from pdfs. However, PyPDF2 can't extract the tables separately from the text and hence it doesn't offer a solution on its own for this kind of problems. 

In [None]:
import pandas as pd  # For data manipulation and analysis.	

# Libraries for PDF data extraction:
import camelot
import PyPDF2

In [None]:
# Relative path to the pdf file:
pdf_path = "CDS_2017_2018_Hamilton.pdf"

# Extract only the page 3 tables from the PDF with 'lattice' option:
tables = camelot.read_pdf(pdf_path, pages="3-3", flavor="lattice")

In [None]:
# Check all elements of 'tables' variable:
for table in tables:
    print(table.df)

<div style="text-align: justify">
The result of the 'tables' variable isn't very useful as it doesn't directly indicate how many tables were extracted. However, we know that page 3 of the PDF includes <strong>3 tables</strong> and the B1 table (the first table) is the table of interest. Let's investigate further!

In [None]:
# Check how many tables extracted from page 3, hoping to see '3' as an answer:
print(len(tables))

Since the number of elements inside 'tables' list is correct, the last step is to check how these elements are separated.

In [None]:
# Check the first element of the variable 'tables':
print(tables[0].df)

Indeed, the first element of the 'tables' list includes all data of the B1 table!!!

In [None]:
# Check if the extraction was accurate:
print(tables[0].parsing_report)

In [None]:
# Assign the B1 pdf's table to a variable:
b1_table = tables[0].df

## Extracted Data Cleaning

In [None]:
# This is already a df:
type(b1_table)

In [None]:
print(b1_table.shape)
b1_table

In [None]:
# Set the correct column names:
b1_table.columns = ["studies_info", "men_fulltime", "women_fulltime", "men_parttime", "women_parttime"]

# Drop observations 0 and 1 (first two rows):
b1_table = b1_table.drop([0, 1], axis=0)
b1_table

<div style="text-align: justify">
A good progress has been made and the information of the original B1 pdf table seems to be shown here in a clear way.  
The next step is to interpret the missing values of the table. The table attempts to present information for two basic categories of students, undergraduates (bachelor) and graduates (higher education level). The empty values of second category probably means that Hamilton Collegege doesn't offer graduate programs or PHDs, at least at 2017-2018 where this study has been conducted.  
For this reason, it is reasonable to drop all these lines.

In [None]:
# Drop rows 9-13:
b1_table = b1_table.drop([9, 10, 11, 12, 13], axis=0)
b1_table

<div style="text-align: justify">
The information of row number 2 should be passed in the column names and then dropped for better data interpretability. The abreviation "ug" will be passed instead of the whole word "undergraduate".

In [None]:
# Set the final column names:
b1_table.columns = ["studies_info", "ug_men_fulltime", "ug_women_fulltime", "ug_men_parttime", "ug_women_parttime"]

# Drop final row:
b1_table = b1_table.drop(2, axis=0)
b1_table

The emtpy cells left doesn't need further investigation and they should be replaced by zeros, because this is what they represent.

In [None]:
# Replace the empty values with 0:
b1_table = b1_table.replace('', 0)
b1_table

In [None]:
# Fixes the indexing:
b1_table = b1_table.reset_index(drop=True)
b1_table

In [None]:
# Replace commas wherever is needed to:
b1_table = b1_table.replace(",", "", regex=True)
b1_table

In [None]:
# Replace the line brakes wherever is needed to:
b1_table = b1_table.replace("\n", "", regex=True)
b1_table

<div style="text-align: justify">
This is the final result and the table data seem to be tidy. The very last step is to ensure the table columns dtypes. Of course, this isn't a typical df where the rows represent observations. Here the rows represent undergraduate students' studies information categories. However, this isn't affecting the original intentions of this project in any way.

In [None]:
# The table contains wrong column dtypes:
b1_table.info()

In [None]:
# Correct the columns dtype:
b1_table[
[
    "ug_men_fulltime", 
    "ug_women_fulltime", 
    "ug_men_parttime", 
    "ug_women_parttime"
]
] = b1_table[["ug_men_fulltime", "ug_women_fulltime", "ug_men_parttime", "ug_women_parttime"]].astype(int)
b1_table.info()

## Basic Statistics and Probability Theory Application
This section is for the purpose of Statistics and Probability Theory, fulfilling the original intention of this project. The main goal is to demonstrate some basic knowledge of probability theory using Jupyter Notebook and Markdown syntax.

__<br>
What is the sample space of Hamilton College students? What are the favorable outcomes of being a male and being a female?__

The total sample space of Hamilton College students is $Ω = 1897$, whereas the favorable outcomes of being a male are $N_{M} = 888$ and the favorable outcomes of being a female are $N_{F} = 1009$.

__What are the probabilities of choosing a male and a female student from the sample space?__

From the classical definition of probability, the probability of choosing a male student out of the sample space is $P_{MALE}=\frac{N_{M}}{Ω}=0.4681$, whereas the corresponding probability of being a female is $P_{FEMALE}=\frac{N_{F}}{Ω}=0.5319$.

__What is the likelihood of being both a female and a freshman?__

This is the intersection, $P(Female∩Freshman)$ and equals to $263/1897$, so, $P(Female∩Freshman)=0.1386$.

__What is the probability of someone being a freshman?__

The probability of someone being a freshman is $P_{FRESHMAN}=0.253$.

__What is the likelihood of someone being a female, given that the same person is a freshman?__

This is the conditional probability:  
$$P(Female∣Freshman)=\frac{P(Female∩Freshman)}{P_{FRESHMAN}}=0.5478$$

**How was $P_{FRESHMAN}=0.253$ calculated?**

<div style="text-align: justify">
Well, this is an event union case where: 
$$P_{FRESHMAN}=P(Female∩Freshman)+P(Male∩Freshman)-P((Female∩Freshman)∩(Male∩Freshman))$$  
Since, $P((Female∩Freshman)∩(Male∩Freshman))=Ø$, because they are mutually exclusive events (a person can't be a male and a female simultaneously):  
$$P_{FRESHMAN}=\frac{263}{1897}+\frac{217}{1897}=0.253$$

__<div style="text-align: justify">
What is the probability of being a male, given that the same person is not a freshman? How is it compared with a not freshman student, given that he is a male?__

<div style="text-align: justify">
The number of male students who are not freshmen equals to $N_{MALENOTFRESHMAN}=671$. The probability can be quickly calculated by dividing this number with the sample space $Ω-N_{FRESHMAN}$, but, let's find the same probability applying some probability theory in the same way as before. The probability that should be calculated is the conditional probability $P(Male∣NotFreshman)$:  
$$P(Male∣NotFreshman)=\frac{P(Male∩NotFreshman)}{P_{NOTFRESHMAN}}=\frac{\frac{671}{1897}}{\frac{1417}{1897}}=0.4735$$

<div style="text-align: justify">
To compare $P(Male∣NotFreshman)$ with $P(NotFreshman∣Male)$ the Bayes' theorem will be used:  
$$P(NotFreshman∣Male)=\frac{P(Male∣NotFreshman)P_{NOTFRESHMAN}}{P_{MALE}}=\frac{0.4735*0.7469}{0.4681}=0.7555$$