<a href="https://colab.research.google.com/github/idv713/jupyter-exploration/blob/main/Manual_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")

#Process a text string
text = "Spacy is a great library for NLP!"
doc = nlp(text)

# Tokenize and print the result
tokens = [token.text for token in doc]
#Iterate through the tokens to print details of each token:
for token in doc:
    print(f"text: {token,text}, Lemma: {token.lemma_}, POS: {token.pos_}")

text: (Spacy, 'Spacy is a great library for NLP!'), Lemma: spacy, POS: NOUN
text: (is, 'Spacy is a great library for NLP!'), Lemma: be, POS: AUX
text: (a, 'Spacy is a great library for NLP!'), Lemma: a, POS: DET
text: (great, 'Spacy is a great library for NLP!'), Lemma: great, POS: ADJ
text: (library, 'Spacy is a great library for NLP!'), Lemma: library, POS: NOUN
text: (for, 'Spacy is a great library for NLP!'), Lemma: for, POS: ADP
text: (NLP, 'Spacy is a great library for NLP!'), Lemma: NLP, POS: PROPN
text: (!, 'Spacy is a great library for NLP!'), Lemma: !, POS: PUNCT


In [7]:
import corpus

ModuleNotFoundError: No module named 'corpus'

In [7]:
!pip install nltk
import nltk
nltk.download('corpus') # Downloads the corpus



[nltk_data] Error loading corpus: Package 'corpus' not found in index


False

# Task
Tell me about this dataset.

Here is all the data you need:
"corpus.txt"

## Data loading

### Subtask:
Load the data from the "corpus.txt" file into a string variable.


**Reasoning**:
I need to read the content of "corpus.txt" into a string variable named `corpus_text`.



In [9]:
try:
    with open('corpus.txt', 'r') as file:
      corpus_text = file.read()
except FileNotFoundError:
    corpus_text = None
    print("Error: corpus.txt not found.")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 469: invalid start byte

**Reasoning**:
The previous attempt failed due to a `UnicodeDecodeError`.  I'll try opening the file with a different encoding, such as 'latin-1' or 'iso-8859-1', which are more likely to handle special characters.



In [11]:
try:
    with open('corpus.txt', 'r', encoding='latin-1') as file:
        corpus_text = file.read()
except FileNotFoundError:
    corpus_text = None
    print("Error: corpus.txt not found.")
except UnicodeDecodeError:
    try:
        with open('corpus.txt', 'r', encoding='iso-8859-1') as file:
            corpus_text = file.read()
    except UnicodeDecodeError:
        corpus_text = None
        print("Error: Could not decode corpus.txt with latin-1 or iso-8859-1 encoding.")

## Data exploration

### Subtask:
Explore the loaded text data to understand its basic characteristics.


**Reasoning**:
Explore the loaded text data by calculating the number of characters, words, and sentences, printing the first 200 characters, and analyzing character frequencies.



In [12]:
import re
from collections import Counter

if corpus_text:
    num_chars = len(corpus_text)
    print(f"Number of characters: {num_chars}")

    words = re.findall(r'\b\w+\b', corpus_text.lower())
    num_words = len(words)
    print(f"Number of words: {num_words}")

    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', corpus_text)
    num_sentences = len(sentences)
    print(f"Number of sentences: {num_sentences}")

    print(f"First 200 characters:\n{corpus_text[:200]}")

    char_counts = Counter(corpus_text)
    top_10_chars = char_counts.most_common(10)
    print("Top 10 most frequent characters:")
    for char, count in top_10_chars:
        print(f"'{char}': {count}")
else:
    print("The text data could not be loaded, so the analysis cannot be performed.")

Number of characters: 3889
Number of words: 583
Number of sentences: 1
First 200 characters:
march madness predictions ncaa bracket expert picks spread odds thursday sweet 
best bets duke-arizona florida maryland alabama byu 
might bemoan lack cinderella second weekend ncaa tournament consola
Top 10 most frequent characters:
' ': 561
'e': 399
'a': 326
'n': 243
's': 233
't': 226
'i': 213
'r': 203
'o': 181
'l': 162


## Data wrangling

### Subtask:
Clean the loaded text data.


**Reasoning**:
Clean the text data by removing special characters, handling newline characters, converting to lowercase, and removing extra whitespaces.



In [13]:
import re

if corpus_text:
    # Remove special characters, keeping letters, numbers, and common punctuation
    corpus_text = re.sub(r'[^\w\s.,!?;:-]', '', corpus_text)

    # Replace newline characters with spaces
    corpus_text = corpus_text.replace('\n', ' ').replace('\r', '')

    # Convert to lowercase
    corpus_text = corpus_text.lower()

    # Remove extra whitespaces
    corpus_text = re.sub(r'\s+', ' ', corpus_text).strip()

## Data analysis

### Subtask:
Analyze the cleaned text data to gain insights.


**Reasoning**:
Calculate word frequencies, analyze sentence lengths, identify frequent words excluding stop words, and report other relevant statistics.



In [14]:
import re
from collections import Counter

if corpus_text:
    # 1. Word Frequencies
    words = re.findall(r'\b\w+\b', corpus_text)
    word_counts = Counter(words)
    print("Word Frequencies:")
    for word, count in word_counts.most_common():
        print(f"{word}: {count}")

    # 2. Sentence Length Distribution
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', corpus_text)
    sentence_lengths = [len(sentence.split()) for sentence in sentences if sentence.strip()]
    avg_sentence_length = sum(sentence_lengths) / len(sentence_lengths) if sentence_lengths else 0
    min_sentence_length = min(sentence_lengths) if sentence_lengths else 0
    max_sentence_length = max(sentence_lengths) if sentence_lengths else 0
    print(f"\nAverage Sentence Length: {avg_sentence_length}")
    print(f"Minimum Sentence Length: {min_sentence_length}")
    print(f"Maximum Sentence Length: {max_sentence_length}")

    # 3. Top 10 Frequent Words (excluding stop words)
    stop_words = {"the", "a", "an", "is", "are", "of", "in", "to", "and", "that", "it", "for", "with", "on", "as", "this", "be", "at", "by", "from", "or", "an", "i", "you", "he", "she", "we", "they", "me", "him", "her", "us", "them"}
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_word_counts = Counter(filtered_words)
    top_10_words = filtered_word_counts.most_common(10)
    print("\nTop 10 Most Frequent Words (excluding stop words):")
    for word, count in top_10_words:
        print(f"{word}: {count}")

    # 4. Other Observations
    print("\nOther Observations:")
    print("The text appears to be about NCAA March Madness predictions.")
else:
    print("The text data could not be loaded, so the analysis cannot be performed.")

Word Frequencies:
tournament: 10
byu: 8
seed: 8
march: 7
regional: 7
madness: 6
duke: 6
florida: 6
game: 6
arkansas: 6
team: 6
arizona: 5
maryland: 5
back: 5
ncaa: 4
best: 4
alabama: 4
will: 4
games: 4
night: 4
coming: 4
big: 4
left: 4
vs: 4
live: 4
pick: 4
thursday: 3
sweet: 3
playing: 3
east: 3
west: 3
one: 3
after: 3
first: 3
get: 3
texas: 3
tech: 3
season: 3
win: 3
can: 3
going: 3
picks: 2
spread: 2
might: 2
cinderella: 2
weekend: 2
eight: 2
programs: 2
five: 2
have: 2
combined: 2
span: 2
gets: 2
started: 2
newark: 2
true: 2
thanks: 2
late: 2
star: 2
freshman: 2
derik: 2
queen: 2
two: 2
history: 2
blue: 2
devils: 2
well: 2
final: 2
coach: 2
john: 2
opening: 2
looks: 2
every: 2
points: 2
bet: 2
cbs: 2
since: 2
feb: 2
oddsmakers: 2
size: 2
efficient: 2
read: 2
just: 2
performance: 2
demin: 2
matchup: 2
tbs: 2
giving: 2
minutes: 2
take: 2
hit: 2
able: 2
plays: 2
flagg: 2
with: 2
shooting: 2
predictions: 1
bracket: 1
expert: 1
odds: 1
bets: 1
bemoan: 1
lack: 1
second: 1
consolation: 1


## Summary:

### 1. Q&A

The provided data analysis task doesn't explicitly pose questions. However, the analysis implicitly seeks to understand the content and characteristics of the "corpus.txt" file.  The analysis answers these implicit questions:

* **What is the basic structure of the text?**  The text is comprised of a single, very long sentence (581 words) containing 583 words and 3889 characters.
* **What are the most frequent words?** The most frequent words, excluding common stop words, relate to NCAA March Madness (e.g., tournament, byu, seed, march, regional, madness).
* **What is the general topic of the text?** The text appears to be about NCAA March Madness predictions.


### 2. Data Analysis Key Findings

* **Single, Extremely Long Sentence:** The text consists of a single sentence with 581 words, which is highly unusual and suggests potential issues with sentence segmentation.
* **Dominant Topic:** The most frequent words (excluding stop words) strongly indicate the text focuses on NCAA March Madness predictions, with mentions of specific teams (BYU, Duke, Florida, Arkansas) and tournament concepts (seed, regional).
* **Average Sentence Length:**  The average sentence length is 581 words due to the single sentence.  Minimum and maximum sentence lengths are both 581 words.


### 3. Insights or Next Steps

* **Review Sentence Segmentation:** The single, extremely long sentence suggests an error in the sentence splitting regular expression.  Re-evaluate the regex or consider more robust sentence boundary detection methods.
* **Further Topic Modeling:**  While the analysis suggests a focus on March Madness, more sophisticated topic modeling techniques could reveal nuances or additional themes within the text.


In [17]:
!jupyter nbconvert --to html Manual_NLP.ipynb

[NbConvertApp] Converting notebook Manual_NLP.ipynb to html
[NbConvertApp] ERROR | Notebook JSON is invalid: Additional properties are not allowed ('errorDetails' was unexpected)

Failed validating 'additionalProperties' in error:

On instance['cells'][2]['outputs'][0]:
{'ename': 'ModuleNotFoundError',
 'errorDetails': {'actions': [{'action': 'open_url',
                               'actionText': 'Open Examples',
                               'url': '/notebooks/snippets/importing_libraries.ipynb'}]},
 'evalue': "No module named 'corpus'",
 'output_type': 'error',
 'traceback': ['\x1b[0;31m---------------------------------------------------------...',
               '\x1b[0;31mModuleNotFoundError\x1b[0m                       '
               'Traceback (...',
               '\x1b[0;32m<ipython-input-7-9f104b9f6dd0>\x1b[0m in '
               '\x1b[0;36m<cell line: ...',
               '\x1b[0;31mModuleNotFoundError\x1b[0m: No module named '
               "'corpus'",
               '