<a href="https://colab.research.google.com/github/morgwork/Distantly-Reading-IPCC-Reports/blob/main/IPCC_Project_Report_Reading_Template_(4_9).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**In this codebook, I will convert select PDF files derived from the *IPCC* archive linked [here](https://archive.ipcc.ch/). The archive covers the first five assessment cycles. The [sixth](https://www.ipcc.ch/assessment-report/ar6/) is being drafted, and I may not include it in the corpus (more on this in the project write-up). I plan to use this codebook as a template for reading reports, meaning I will copy it as multiple files to handle the multiple reports.**


### Conversion
In this section I convert the PDF report to a machine-readble string. Most of this code is from an external library, so I put links in as reference. The code's ultimate output is a string for subsequent cleaning and analysis.

In [None]:
# opens PDF. I suppose this is just to make sure it can be read...realize now that it's separate libraries

#install pyDF2
! pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('', 'rb') # put file here 

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages) # fileReader becomes object

In [None]:
! pip install pdfminer # successful installation of PDFminer
import pdfminer

Got the following code from: https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library/8325135#8325135. 

In [None]:
from io import StringIO

Learned about this from: https://stackoverflow.com/questions/28200366/python-3-x-importerror-no-module-named-cstringio.

In [None]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams(char_margin = 20) 
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    fp = open('', 'rb') # have to put file path in here
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

In [None]:
# here is where we get a string from the PDF, now need to export the string as txt
convert_pdf_to_txt(fileReader)
=convert_pdf_to_txt(fileReader) # decide on new object name for converted file at the start here
print() # object to make sure it comes out as a string. this works! just takes a little bit of time

## Cleaning
In this section, I  case, lemm, and convert the cleaned files for analysis. 

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords

In [None]:
# Remove stopwords function.

def remove_Stopwords(text):
    stop_words = set(stopwords.words('english')) 
    words = word_tokenize( text.lower() ) 
    sentence = [w for w in words if not w in stop_words]
    return " ".join(sentence)

# Lemmatize function.    
def lemmatize_text(text):
    wordlist=[]
    lemmatizer = WordNetLemmatizer() 
    sentences=sent_tokenize(text)
    for sentence in sentences:
        words=word_tokenize(sentence)
        for word in words:
            wordlist.append(lemmatizer.lemmatize(word))
    return ' '.join(wordlist) 

# Cleaning text function.
def clean_text(text): 
    delete_dict = {sp_character: '' for sp_character in string.punctuation} 
    delete_dict[' '] = ' ' 
    table = str.maketrans(delete_dict)
    text = text.translate(table)
    textArr= text.split()
    text= ' '.join([w for w in textArr]) 
    
    return text.lower()

## Analysis
In this section, I will analyze the cleaned files with some basic word counts to highlight significant terms from each of the reports. I will also create a few WordCloud visualizations for each report, one with all collocates true and  two or three that remove disproportionately represented terms and bigrams.

In [None]:
from collections import Counter

words = .split() # need to assign object to function, from week 4 codebook

types = Counter(words)

print(types)

In [None]:
import numpy as np # from bibliometric analysis codebook
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
from wordcloud import WordCloud

In [None]:
from nltk import FreqDist # from week 4 codebook
freq = FreqDist() # need to assign an object here
print (freq.most_common(100))


In [None]:
df = pd.DataFrame(list(freq.most_common(20))) # from week 4 codebook, can use to make dataframe then list of most frequent words (in the cleaned files)

In [None]:
wordcloud = WordCloud(background_color=(246,235,189),max_words=10, width=1600, height=800,random_state=30,stopwords=[]).generate(' '.join(df[''].tolist())) # need to get reports into a dataframe of some kind
plt.figure( figsize=(20,10))
plt.imshow(wordcloud);