Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit to explain how to handle PDF date #112

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added dataset/complaints.pdf
Binary file not shown.
425 changes: 425 additions & 0 deletions examples/reading_pdf_data.ipynb

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions website/blog/2020-07-25-read-pdf-data-into-texthero.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Read Text Data From PDF and use it in Texthero
author: Selim Alawwa
unlisted: True
---

## Introduction
PDF files are widely used due to its broad compatibility however handling or editing its data is a rather uneasy task. Since many reports and documents that contain valuable information are in PDF and analysing this data can be provide high value to users, we thought we can show how we can read PDF data and use them in texthero.

## Working with PDF
Python pandas does not support reading PDF file by default so we will explore the various options and python libraries that enable us to read PDF files and load them into python data frames

## Example using Tabula
tabula-py is a simple Python wrapper of tabula-java, which can read tables from PDF and convert them into pandas’s DataFrame. The library requires having Java 8+ and Python 3.5+.

```
import texthero as hero
try:
import tabula
except:
!pip install tabula-py
import tabula
```

read_pdf funtion returns an array of dataframes obtained from the pdf file.
```
dataframes = tabula.read_pdf("path/to/pdf/file", pages='all')
df = dataframes[0]
df.head()
```
Apply texthero clean method then visualize the a simple word cloud
```
df['text_clean'] = df['text'].pipe(hero.clean)
df['text_clean'].pipe(hero.wordcloud, max_words=50)
```