#Day 2 : Unsupervised Learning Techniques

***Intro:*** Unsupervised learning is a type of machine learning where the algorithm learns patterns or structures from unlabeled data. 

***The key difference between supervised and unsupervised learning:*** Unlike supervised learning, which requires labeled data with input-output pairs, unsupervised learning algorithms work on data without any predefined output labels or target variables.

***The objective of unsupervised learning*** is to discover inherent patterns, relationships, or structures within the data. It aims to uncover hidden insights, group similar data points together, or identify meaningful representations of the data without any prior knowledge or guidance.

***Common tasks in unsupervised learning include:***

1. Clustering techniques: Grouping similar data points together based on their characteristics or proximity.

2. Dimensionality Reduction: Reducing the number of features or variables in the data while preserving its essential information.

3. Anomaly Detection: Identifying unusual or rare instances that deviate significantly from the norm.

4. Association Rule Mining: Discovering interesting associations or relationships between different variables in the data.

Unsupervised learning algorithms rely on mathematical techniques such as clustering algorithms (e.g., K-means, DBSCAN), dimensionality reduction methods (e.g., PCA, t-SNE), and density estimation techniques (e.g., Gaussian Mixture Models) to learn patterns and extract valuable insights from the data.

Unsupervised learning has various applications, including customer segmentation, recommendation systems, anomaly detection, image and text analysis, and data exploration. It plays a crucial role in exploratory data analysis and can provide valuable insights when dealing with large, unstructured, or unlabeled datasets.

##Unsupervised Learning Example 
####Project: Clustering Analysis of COVID-19 Tweets

Description:
You have been provided with a dataset containing a collection of tweets related to the COVID-19 pandemic. The objective of this project is to apply unsupervised learning techniques to cluster the tweets based on their content and identify common themes or topics discussed on Twitter during the pandemic.



Tasks:
1. Data Preprocessing:
   - Load the COVID-19 Twitter dataset.
   - Text Cleaning: Removal of special characters, URLs, and unnecessary white spaces.
   - Tokenization: Splitting the tweet text into individual words or tokens.
   - Stopword Removal: Removing common words that do not carry significant meaning.
   - Stemming or Lemmatization: Reducing words to their base form (e.g., running -> run) for normalization.

2. Feature Extraction:
   
   Convert the preprocessed text data into numerical representations suitable for clustering.
   - Utilize techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) to represent the tweets as feature vectors.

3. Unsupervised Learning Techniques:
  - Clustering: Grouping similar tweets together based on their content or topics.
  - Topic Modeling: Extracting latent topics from the tweets using techniques like Latent Dirichlet Allocation (LDA).
  - Word Embeddings: Representing words or phrases as dense vectors to capture semantic relationships.
  - Dimensionality Reduction: Reducing the dimensionality of the dataset using techniques like Principal Component Analysis (PCA) or t-SNE.

4. Cluster Analysis and Visualization:
   - Analyze the content and themes of each cluster to gain insights into the different topics discussed in COVID-19 tweets.
   - Use visualization techniques (e.g., word clouds, bar charts) to visualize the most frequent words or phrases within each cluster.
   - Identify and label the clusters based on the dominant themes present in the tweets.

5. Interpretation and Insights:
   - Interpret the clustering results and identify the major topics or discussions surrounding COVID-19 on Twitter.
   - Explore the temporal aspects of the tweets to observe any changes in the topics over time.
   - Discuss any interesting findings or patterns discovered through the analysis.

In [None]:
https://www.kaggle.com/datasets/gpreda/all-covid19-vaccines-tweets

Let's start the project.

1. Data Preprocessing: 

  - Load the COVID-19 Twitter dataset.
     You could download the COVID-19 Twitter dataset from this [github](https://github.com/Sammyjoon/COVID-19_Twitter) repositiory.

The dataset in the form of `CSV` file. To work with csv in Python you need `pandas` package. Using the pandas library makes working with CSV files in Python easier and more efficient. `pandas` provides high-level data structures and data analysis tools, making it a popular choice for working with structured data, including `CSV` files. 

In [1]:
#import pandas
import pandas as pd


- Reading data from a `CSV` file and save it in a dataframe available in 
`pandas`. The DataFrame is a two-dimensional tabular data structure that provides easy indexing, slicing, and manipulation of data. 

In [None]:
df = pd.read_csv('specify the file path or URL as the argument')
print(df)


Accessing data in a DataFrame:


In [None]:
name_column = df['Name'] #Column-wise access
first_row = df.loc[0]  # Access row by label (index)
second_row = df.iloc[1]  # Access row by integer location

- Text Cleaning

   To demonstrate common text cleaning techniques you need to use the nltk library i. Python. NLTK provides various functionalities and resources for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Text to be cleaned
text = "This is an example sentence! It contains punctuation marks and stopwords."

# Lowercasing
text = text.lower()

# Tokenization
tokens = word_tokenize(text)

# Removing Punctuation
tokens = [token for token in tokens if token not in string.punctuation]

# Removing Stop Words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Print the cleaned tokens
print(tokens)


2. Feature Extraction:
     
     Feature extraction is a process in machine learning and data analysis that involves transforming raw data into a set of meaningful and informative features. It aims to capture the most relevant information from the data and represent it in a way that is suitable for the learning algorithms.
   - Utilize techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) to represent the tweets as feature vectors.



To do feature extraction we use `Scikit-learn` library. `Scikit-learn` is a popular machine learning library in Python that provides a wide range of tools for various tasks, such as data preprocessing, feature selection, model training, and evaluation. 

To use `Scikit-learn` we need to install it first.

In [2]:
!pip install scikit-learn


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Feature extraction using the `TF-IDF` technique in Python using the `Scikit-learn` library

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the corpus and transform the corpus into TF-IDF features
X = vectorizer.fit_transform(corpus)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names()

# Print the TF-IDF features
for i, doc in enumerate(corpus):
    print(f"Document {i+1}:")
    for j, word in enumerate(feature_names):
        print(f"{word}: {X[i, j]}")
    print()


3. Unsupervised Learning Techniques:
  - Clustering: Grouping similar tweets together based on their content or topics.
  - Topic Modeling: Extracting latent topics from the tweets using techniques like Latent Dirichlet Allocation (LDA).
  - Word Embeddings: Representing words or phrases as dense vectors to capture semantic relationships.
  - Dimensionality Reduction: Reducing the dimensionality of the dataset using techniques like Principal Component Analysis (PCA) or t-SNE.



4. Cluster Analysis and Visualization:
   - Analyze the content and themes of each cluster to gain insights into the different topics discussed in COVID-19 tweets.
   - Use visualization techniques (e.g., word clouds, bar charts) to visualize the most frequent words or phrases within each cluster.
   - Identify and label the clusters based on the dominant themes present in the tweets.



5. Interpretation and Insights:
   - Interpret the clustering results and identify the major topics or discussions surrounding COVID-19 on Twitter.
   - Explore the temporal aspects of the tweets to observe any changes in the topics over time.
   - Discuss any interesting findings or patterns discovered through the analysis.