![head.png](https://github.com/iwh-halle/FinancialDataAnalytics/blob/master/figures/head.jpg?raw=1)

# Financial Data Analytics in Python

**Prof. Dr. Fabian Woebbeking**</br>
Assistant Professor of Financial Economics

IWH - Leibniz Institute for Economic Research</br>
MLU - Martin Luther University Halle-Wittenberg

fabian.woebbeking@iwh-halle.de

# Homework: Natural Language Processing (NLP)

You will need a Git/GitHub repository to submit your course deliverables. Consult [**slides.ipynb**](https://github.com/iwh-halle/FinancialDataAnalytics) for help with the tasks below! If you need further assistance, do not hesitate to open a Q&A at https://github.com/iwh-halle/FinancialDataAnalytics/discussions

### Task: Sourcing

The first stage involves sourcing and reading the textual content of scientific papers. You find an example pdf file in ``../lit/nonanswers.pdf``. Please [download](https://scholar.google.de/) and analyze at least one additional paper of your choice (make sure to commit the paper to your repository).

Use an appropriate PDF reading library or tool to programmatically extract the text. You can find an example below, however, you are free to use any Python library you like.

In [8]:
# Step 1: Install pdfminer.six if you haven't already
# You can install it using conda or pip, see 
  # https://anaconda.org/conda-forge/pdfminer.six
  # https://pypi.org/project/pdfminer.six/

# Step 2: Import the required module
from pdfminer.high_level import extract_text

# Step 3: Extract text from PDF file
extracted_text = extract_text('../lit/nonanswers.pdf')
print(extracted_text[0:80])

“Let me get back to you” –
A machine learning approach to measuring
non-answers



In [4]:
from pdfminer.high_level import extract_text
extracted_text = extract_text('../Research_Paper_on_Basic_of_Artificial_Ne.pdf')
print(extracted_text[0:800])  # Print the first 80 characters to check the extraction


International Journal on Recent and Innovation Trends in Computing and Communication         
           ISSN: 2321-8169 
Volume: 2 Issue: 1                                                                                                                                                                                           96 – 100 
_______________________________________________________________________________________ 

Research Paper on Basic of Artificial Neural Network 

Ms. Sonali. B. Maind 
Department of Information Technology 
Datta Meghe Institute of Engineering, Technology & Research, Sawangi (M), Wardha  
sonali.maind@gmail.com 

Ms. Priyanka Wankar 
Department of Computer Science and Engineering 
Datta Meghe Institute of Engineering, Technology & Research, Sawangi (M), Wardha 


### Task: Pre-processing

Pre-processing is a critical step aimed at cleaning and preparing the text data for analysis. Steps that you should consider:

* Removing punctuation, numbers and special characters using regular expressions.
* Converting all the text to a uniform case (usually lower case) to ensure that the analysis is not case-sensitive.
* Stop word removal, i.e. eliminating commonly used words (e.g., 'and', 'the', 'is') that do not contribute significantly to the overall meaning and can skew the analysis.
* Other potential pre-processing steps might include stemming and lemmatization, depending on the specific requirements and goals of the analysis. (optional)

In [11]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from pdfminer.high_level import extract_text

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def remove_punctuation_numbers(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

def to_lowercase(text):
    return text.lower()

def remove_stopwords(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return ' '.join(tokens)

def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(lemmatized_tokens)

def preprocess_text(text):
    text = remove_punctuation_numbers(text)
    text = to_lowercase(text)
    text = remove_stopwords(text)
    text = lemmatize_text(text)
    return text

# Step 1: Specify the path to your PDF file
pdf_file_path = '../Research_Paper_on_Basic_of_Artificial_Ne.pdf'

# Step 2: Extract text from the PDF file
extracted_text = extract_text("../Research_Paper_on_Basic_of_Artificial_Ne.pdf")

# Step 3: Preprocess the extracted text
preprocessed_text = preprocess_text(extracted_text)

print("Original Text:")
print(extracted_text[0:500])  # Print the first 500 characters of the original text
print("\nPreprocessed Text:")
print(preprocessed_text[0:500])  # Print the first 500 characters of the preprocessed text


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Text:
International Journal on Recent and Innovation Trends in Computing and Communication         
           ISSN: 2321-8169 
Volume: 2 Issue: 1                                                                                                                                                                                           96 – 100 
_______________________________________________________________________________________ 

Research Paper on Basic of Artificial Neural Network 

Ms. Sonali. B. Mai

Preprocessed Text:
international journal recent innovation trend computing communication issn volume issue research paper basic artificial neural network m sonali b maind department information technology datta meghe institute engineering technology research sawangi wardha sonalimaindgmailcom m priyanka wankar department computer science engineering datta meghe institute engineering technology research sawangi wardha priyankawankargmailcom abstractan artificial neural network an

### Task: Analysis

The final stage is the analysis of the pre-processed text, in order to extract meaningful context. This may involve:

* Frequency Analysis: Determining the most commonly occurring words or phrases, which can provide initial insights into the primary focus areas of the papers. Consider, e.g. a word cloud as a visualization.
* Contextual Analysis: Using more advanced NLP techniques such as Word Embedding or Topic Modeling to understand the context of the papers.
* Sentiment analysis: We would expect that scientific papers are written in a neutral tone, can you confirm this?
* Summarization: Employing algorithms to generate concise summaries of the papers, capturing the key points and findings.

Pick any method that you like (you are allowed to use ChatGPT's API as well).

In [6]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud


def plot_wordcloud(text):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

# Generate word cloud for the preprocessed text
plot_wordcloud(preprocessed_text)


NameError: name 'preprocessed_text' is not defined