 # GROUP FOUR
1. S084-01-2341/2021 PETER KARURI NDUNGU
2. S084-01-2342/2021 MAURINE CHEMUTAI
3. S084-01-2299/2021 SAMUEL MUTAHI
4. S084-01-2301/2021 WINROSE WANGUI
5. S084-01-2742/2021 AMOS KIPNGENO
6. S084-01-2318/2021 ERICK MUEMA

# SMA 4143 DAII – Text Analytics
1. **Define Text analytics and discuss its importance in data science.**   
 **Text Analytics** involves extracting meaningful information from text data using various computational techniques.  
 **Importance in Data Science:**  
 *  *Insights Extraction:*  
  Helps in understanding trends, patterns, and sentiments from large text datasets.
  *   *Decision Making:*  
 Facilitates better decision-making by providing a deeper understanding of textual data, such as trends, themes, and emerging issues.
  *   *Automation:*  
   Enables automation of tasks like categorizing documents, detecting spam, and summarizing content, increasing efficiency and productivity.

2.  **Discuss the importance of text data pre-processing.**  
* ***Noise Reduction:*** Removes irrelevant information (e.g., punctuation, stop words), making the data cleaner.
****Normalization:*** Standardizes text data to ensure consistency, which is important for accurate analysis (e.g., converting all text to lowercase).
****Improves Model Performance:*** Properly pre-processed text data enhances the performance of machine learning models by ensuring that the data is in a suitable format.
****Feature Extraction:*** Facilitates the extraction of meaningful features from text, which can be used for various analytical purposes.

**3. Explain the following techniques as used in text analytics and discuss their significance:**
     
***(a) Text Normalization:***

* It is the process of converting text into standard format (e.g., lowercasing).

*Significance:*   
* Ensures uniformity in text data by handling variations such as case sensitivity, punctuation, and contractions (e.g., "don't" to "do not").  

***(b) Tokenization:***  
* It is the process of Splitting text into individual words or tokens.  

*Significance:*  
* Fundamental for many text processing tasks, enabling the analysis of text at the word level.

***(c) Stop-word Removal:***
* Eliminates common words that carry little meaning (e.g., "and", "the").

*Significance:*  
 * Reduces dimensionality and focuses on more meaningful words.  

***(d) Stemming:***  
* Reduces words to their root form (e.g., "running" to "run").  

*Significance:*   
* Helps in standardizing words with similar meanings, improving the efficiency of text analysis.  

***(e) Lemmatization:***  
* Converts words to their dictionary form (e.g., "better" to "good").

*Significance:*   
* Provides more accurate word representations compared to stemming, aiding in better text understanding.  

***(f) Text Encoding:***  
 * Converts text into numerical representations (e.g., one-hot encoding, TF-IDF).

*Significance:*  
 * Necessary for machine learning models to process text data.

***(g) Vectorization and Embeddings:***  
* Transforms text into vector representations (e.g., Word2Vec, BERT embeddings).  

*Significance:*  
 * Captures semantic meaning and context of words, improving model performance.

***(h) Padding/Truncation:***  
* Adjusts text sequences to a uniform length by adding padding or truncating.

*Significance:*   
* Ensures consistency in input lengths for models, particularly in deep learning.


 **4. What is sentiment analysis? Describe a scenario where sentiment  analysis can provide valuable insights.**  
 * It is the process of determining the emotional tone behind a series of words, used to understand the attitudes, opinions, and emotions expressed within a text.

 * *Scenario:*  
 A company launching a new product can use sentiment analysis on social media posts and customer reviews to gauge public reaction. Positive sentiments can affirm marketing strategies, while negative sentiments can highlight areas for improvement, leading to better customer satisfaction and product refinement.

**5. Explain how sentiment analysis can be performed using machine learning techniques.**  
* Sentiment analysis using machine learning techniques involves several steps, from data collection and pre-processing to model training, evaluation, and prediction.   

***Steps to Perform Sentiment Analysis Using Machine Learning***  
* *Data Collection:*  
 * Gather a dataset containing text data with sentiment labels.

* *Data Pre-processing:*
  * *Text Cleaning:*  
Remove noise such as  special characters and URLs.  
  * *Text Normalization:*  
 Convert all text to lowercase to ensure uniformity.
  * *Tokenization:*  
 Split the text into individual words or tokens.
 * *Stop-word Removal:*
 Remove common words (e.g., "and", "the") that do not carry significant meaning.  

 * *Stemming/Lemmatization:*   
Reduce words to their root form to standardize them.
Example: Convert "The movie was fantastic!" to ["movie", "fantastic"].  
* *Feature Extraction:*  
Convert text data into numerical representations that can be fed into a machine learning model.
     * *Bag of Words (BoW):*   
     Represent text as a vector of word counts.

     * *TF-IDF (Term Frequency-Inverse Document Frequency):*  
      Weigh the frequency of words by their importance.
     * *Word Embeddings:*  
      Use pre-trained embeddings like Word2Vec, GloVe, or contextual embeddings like BERT to capture semantic meaning.
Example: "The movie was fantastic" -> [0.1, 0.2, ..., 0.5] (vector representation).
*  *Model Selection:*       
  * Choose an appropriate machine learning algorithm for example **Naive Bayes** *which  assumes independence between features and works well for text classification*, **Support Vector Machines (SVM):** *which is effective for high-dimensional spaces and text classification*, **Logistic Regression:** *which is a simple and efficient baseline for binary classification tasks.*  

*  *Training the Model:*  
  * Splits the dataset into training and validation sets.  
  * Train the selected model on the training data.

*  *Model Evaluation:*  
 * Evaluate the model's performance using metrics like accuracy, precision, recall and F1-score.
 * Use a validation set to tune hyperparameters and avoid overfitting.

*  *Prediction*  
 * Apply the trained model to new, unseen text data to predict sentiment.
Example: Predict the sentiment of new movie reviews using the trained model.
* Post-processing and Visualization:*  
 * Analyze the results and visualize the sentiment distribution, trends, and patterns.







6. **Write a python code to preprocess the Text data provided. Submit a comprehensive Jupiter
notebook.**

In [1]:
# Import necessary libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer



# Sample text data from the provided document
text_data = """
Once upon a time, in a small village, there lived a QUICK brown fox named Felix. He was
famous for his speed and agility. 🦊 Felix could run at 25 miles per hour (mph) and jump over
obstacles twice his height! 🌟
"""

# 1. Text Normalization
text_data = text_data.lower()

# 2. Tokenization
tokens = word_tokenize(text_data)

# 3. Stop-word Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# 4. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# 5. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

# Print the results
print(" Normalized Text:", text_data)
print("\n Tokens:\n", tokens)
print("\n Filtered Tokens:\n", filtered_tokens)

print("\nStemmed Tokens:\n", stemmed_tokens)
print("\nLemmatized Tokens:\n", lemmatized_tokens)

 Normalized Text: 
once upon a time, in a small village, there lived a quick brown fox named felix. he was
famous for his speed and agility. 🦊 felix could run at 25 miles per hour (mph) and jump over
obstacles twice his height! 🌟


 Tokens:
 ['once', 'upon', 'a', 'time', ',', 'in', 'a', 'small', 'village', ',', 'there', 'lived', 'a', 'quick', 'brown', 'fox', 'named', 'felix', '.', 'he', 'was', 'famous', 'for', 'his', 'speed', 'and', 'agility', '.', '🦊', 'felix', 'could', 'run', 'at', '25', 'miles', 'per', 'hour', '(', 'mph', ')', 'and', 'jump', 'over', 'obstacles', 'twice', 'his', 'height', '!', '🌟']

 Filtered Tokens:
 ['upon', 'time', 'small', 'village', 'lived', 'quick', 'brown', 'fox', 'named', 'felix', 'famous', 'speed', 'agility', 'felix', 'could', 'run', '25', 'miles', 'per', 'hour', 'mph', 'jump', 'obstacles', 'twice', 'height']

Stemmed Tokens:
 ['upon', 'time', 'small', 'villag', 'live', 'quick', 'brown', 'fox', 'name', 'felix', 'famou', 'speed', 'agil', 'felix', 'could', 'ru