## Step 1 -- Covert the transcript samples from pdf to jpg format 

### Install two required modules: pdf2image and poppler

#### Install pdf2image

In [1]:
pip install pdf2image

Note: you may need to restart the kernel to use updated packages.


Download poppler
https://anaconda.org/conda-forge/poppler/files

#### Installing poppler


Installing `poppler` from the `conda-forge` channel can be achieved by adding `conda-forge` to your channels with:

```
conda config --add channels conda-forge
```

Once the `conda-forge` channel has been enabled, `poppler` can be installed with:

```
conda install poppler
```

## Convert two transcript samples from pdf format to image jpg format

<b>Approach:</b>

- Import the <b>pdf2image</b> module
- Store a PFD with <b>convert_from_path()</b>
- Save image with <b>save()</b>

In [2]:
# import module
from pdf2image import convert_from_path


# Store Pdf with convert_from_path function
file_path = 'sample_1.pdf'
images = convert_from_path(file_path)

for i in range(len(images)):

# Save pages as images in the pdf
     images[i].save('.\images\sample_'+ str(i+1) +'.jpg', 'JPEG')

In [3]:
# Store Pdf with convert_from_path function
file_path = 'sample_2.pdf'
images = convert_from_path(file_path)

for i in range(len(images)):

# Save pages as images in the pdf
     images[i].save('.\images\sample_'+ str(i+2) +'.jpg', 'JPEG')

## Step 2. Using tesseract, Open CV and Pytesseract to convert image to text strings

### Install tesserac & pytesserac

In [4]:
pip install tesseract

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install pytesseract

Note: you may need to restart the kernel to use updated packages.


### Transcript Sample #1

![Sample%20Transcript%201.jpg](attachment:Sample%20Transcript%201.jpg)

In [6]:
import cv2
import numpy as np
import pytesseract

# tell where the Tesseract engine is installed (Window). 
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image_path = './images/sample_1.jpg'

# Using openCV to read image file; can also use Pillow module, PIL.Image
img = cv2.imread(image_path)

text = pytesseract.image_to_string(img)
print(text, file=open("output1.txt", "a"))

### Output:

Student Name/Address/Phone Student IO BOK
Official Transcript
POOOOQQQOM Lcenenenenimenl
POCOOCOO) POCO OOOO
POOQOOCOOOO POOOOOOOOON
Counselor Term Ending] Class of Grade
07/28/2016] 2017 12
Exit Reason Grad Date
Issued To Print Date GPA Type GPA Crat Atmpt
Overall Weighted 3.16 190.00
09/15/2016
1 of 1
Total Credits Earned 190.00
Course 10 Course fark Credits || Course (D_Course irk Credits | Graduation Requirements Short Req cmp]
DOOOOOOOOY Grd 0) HSSem | 01/14 | eereren Grd) HSSem1 01716] Comprehensive Grad Requirement
712223 © PE Course 1 A 5.00 |*+737450 AP Biology (HP) Cc 5.00] English 10.00 40.00 30,00
*737500 Biology (P) B- 5.00 |*+748380 AP Psychology (HP) 8. 5.00 Global Studies 5.00 $,00
728021 Health Education B 5.00] "748180 US Hist (P) A 5.00] World History 10.00 10.00
653635 Freshman English (P) c+ 5.00 ]*653795 Junior English (P) fom 5.00] US History 10.00 40,00
*865100 Spanish | (P) B 5.00] "685140 Spanish Ili (P} B 5.00] Civies 5.00 5.00 9,00
697110 Hon Geometry (P) c 5.00 |*+697230 AP Calculus AB (HP) D+ 5.00] Economics 5.00 5.00 0.00
TERM: GPA 2.83. Credits 30.00 | TERM: GPA 2.83 Credits 30.00] Math 20.00 20.00
CUMULATIVE; GPA 2.83 Credits 30.00] CUMULATIVE: GPA 3.08 Credits 160.00 | Physical Science 10.00 10.00
Life Science 10.00 10.00
DOOOOOOOOF Grd 09 HSSen2 06/14 | TeTerere Grd (1 HSSem2 06/16] VPA/World Language 10.00 10.00
712223 PE Course 1 A 5.00] *+737450 AP Biology (HP) c 5.00] Physical Education 10.00 20.00 10.00
"737500 _Biclogy (P) B+ 5.00 |*+748380 AP Psychology (HP) B 5.00| Health Education 5.00 5,00
“748330 Global Studies (P) B 5.00 | *748180 US Hist (P) A 5.00] Electives 10,00 80.00 70,00
"653635 Freshman English (P) B+ 6.00 | "653795 dunior English (P) F 0.00} Total Grad Credits 40.00 230.00 190.00
“865100 Spanish | (P) B+ 5.00 | "665140 Spanish fil (P) B 5.00] All Courses. 190.00 190.00
"697110 Hon Geometry (P} C+ 5.00 |*+697230 AP Calculus AB (HP) B 5.00] Speech Prof Passed
TERM: GPA 3.00 Credits 30.00 | TERM: GPA 3.60 Credits 26.00
CUMULATIVE: GPA 2.92 Credits 60.00] CUMULATIVE: GPA 3.46 Credits 185.00 | RE ———|
T
Out af District/Transfer Credits Grd (0 Summer Se 08/14 | AS SSTeTeTS Grd (1 HS Summer 07/16] EE
"TRUCO12 Sum Alg2 (P) / Transfer Bt 10.00 | "653795SS Junior English (P) B+ 5.00
TERM: GPA 3.00 Credits 10.00 | TERM: GPA 3.00 Credits 5.00
CUMULATIVE: GPA 2.93 Credits 70.00) CUMULATIVE: GPA 3.16 Credits 190.00
DOOOOOOOOF Grd 0 HS Sen} O1/1S [Work In Progress Entry Exit
°737660 Chemistry (P) A. 5.00 Grd (2
*748450 Wrid Hist 2-3 (P) A 5.00] 748220 Civics (P) O8NSNS 12/2216
"653655 Sophomore English (P} A 5.00] 624250 Photography 1 (P) 08/15/16 06/02/17
"665120 Spanish Il (P) Ae 5.00] 633210 Lit & Soc Just (P) Ossi 06/02/17
*+697370 AP Computer Sci (HP) c 5.00] 697145 AP Statistics (HP) 08/15/16 06/02/17
"697120 Hon Pre-Calculus (HP) B- 5.00] 737700 Physics (P) Osis 06/02/17
TERM: GPA 3.67 Credits 30.00] 748320 AP Human Geo (HP) OB15/16 06/02/17
CUMULATIVE: GPA 315 Credits 100.00] 748260 Economics (P) 12/2316 06/02/17
DOOOOOOOOF Grd 10 HS Sen 2 06/15
"737660 Chemistry (P) B+ 5.00
748450 Wrid Hist 2-3 (P) 8 5.00
"853655 Sophomore English (P) B+ 5.00
665120 Spanish Il (P} A 5.00
*+697370 AP Computer Sci (HP} B 5.00
*697120 Hon Pre-Calculus (HP) c- 5.00
GPA 3.17 Credits 30.00
GPA 3.15 Credits 130.00

### Transcript Sample #2

![Sample%20Transcript%202.jpg](attachment:Sample%20Transcript%202.jpg)

In [7]:
image_path = './images/sample_2.jpg'

img = cv2.imread(image_path)
text = pytesseract.image_to_string(img)
print(text, file=open("output2.txt", "a"))

### Output:

District Name: [eee eee eee ee eee) POQOOOKXXY)
(aes High School Transcript Student Number: gym Grade: 12
D000 GOO OOOO OOOOOOOOOOOOOOO™DKDET Generated EaEeeererererererereres Page 1 of 1

Student Information XK KKKKK KKK KOKO Credit Summary
Course Mark Weight Credit |Curriculum Program: High School
Student Number: (ae Grade: 12
Birthdate: POO) Gender: M 2016-2017 Grade 10 Term 1 High School Attempted Earned Required Remaining}
State ID: POoooeg 7060 A Choir A 5.0000 5 [English 30.000 30,000 40.000 10.000
Counselor: POOOOO™XY 3620 Chemistry H A 5.0000 5 ELD 0.000 0.000 0.000 0.000
Place of Birth: United States 4440 Chinese 4H A 5.0000 5 —JAgebra 10.000 10.000 10.000 0.000
Diploma Date: 4850 Law B 5.0000 5 _~——|Geometry 20.000 20.000 10.000 0.000
2400 Math Analysis B 5.0000 5 _—|Life Science 20.000 +~—«- 20.000 10.000 0.000
GPA Summary 1170 World Studies B 5.0000 5 _| Physical
1770 World Studies SS A 5.0000 5 _—‘|Science 10.000 10.000 10.000 0.000
Cumulative GPA (Unveighte) 3.400 Se NERD UStisery” 40.000 10000 10.000 0.000
2016-2017 Grade 10 Term 2 Economics 0.000 0.000 5.000 5.000
CA Cal Grant GPA 3.423 7060 A Choir A 5.0000 5 —_ Jus Government0.000 0.000 5.000 5.000
3620 Chemistry H B 5.0000 5 Physical
Unwghtd 10-12 A-G GPA 3.375 4440 Chinese 4H B 5.0000 5 __—_|Education 10.000 10.000 20.000 10.000
4850 Law B 5.0000 5 ~——|World
2400 Math Analysis B 5.0000 5_—|Language** 30.000 += 30.000 10.000 0.000
Unwalnid. 2:12. Ae1 GRA 8.824 1170 World Studies B 5.0000 5__—s Fine Arts** 20.000 +=. 20.000 10.000 0.000
1770 World Studies SS A 5.0000 5 Applied
Enrollment Summary Credit: 35.000 U/W GPA: 3.286 [Academics** 30.000 30.000 10.000 0.000
Electives 0.000 0.000 60.000 _ 0.000
Starvend Date Grade Schoo! 2017-2018 Grade 11 Term 1 Total 200.000 200.000 220.000 30.000
08/17/2015-06/02/2016 09 A ARAAOTOZOES 1130 Amer Lit/Writ B 5.0000 5
po] 3120 AP Biology A 5.0000 5
08/15/2016-06/01/2017 10 ,_eerererers 4450 AP Chinese Lng/Cul ‘A, (510000 5 Comments
08/14/2017-0531/2018 11 (eee 1750, APUS History B 5.0000 5 | Unofficial transcript
DOO 2420 Pre-Calculus H B 5.0000 5
08/20/2018- 12 (eee 4870 Princ of Mktng A 5.0000 5
ree Credit: 30.000 UAW GPA: 3.500
2017-2018 Grade 11 Term 2
POOOOOOOOOOOOOOOODX 1130 Amer LitWrit B 5.0000 5
Course Mark Weight Credit 3120 AP Biology A 5.0000 5
2015-2016 Grade 09 Term 1 saa x Osten” Cul . een
2320 Algebra 2/Trig Cc 5.0000 5 :
7030 B Choir K 5.0000 5 2420 Pre-Calculus H B 5.0000 5
3110 Biology A 5.0000 5 4870 Princ of Mktng A 5.0000 5
. Credit: 30.000 UAW GPA: 3.333
4430 Chinese 3 A 5.0000 5
1010 LitWrit B 5.0000 5
2510 PEQ A 5.0000 5 Work In Progress
4580 Princ of Business A 5.0000 5 2430 AP Calculus AB 5.000
Credit: 35.000 UW GPA: 3.571 1875 AP Macroeconomics 5.000
2015-2016 Grade 09 Term 2 3750 AP Physics 1 5.000
2320 Algebra 2/Trig Cc 5.0000 5 4860 Money and Banking 5.000
7030 B Choir A 5.0000 5 1300 Myth/Folk/Writ 5.000
3110 Biology B 5.0000 5 2740 PE Wt Training 5.000
4430 Chinese 3 B 5.0000 5
1010 LitWrit B 5.0000 5
2510 PE B 5.0000 5
4580 Princ of Business A 5.0000 5

Credit: 35.000 U/W GPA: 3.143