1. Obtain the Named entity relations in the document

In [1]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = """Microsoft, founded by Bill Gates and Paul Allen, is headquartered in Redmond, Washington."""

# Process the text
doc = nlp(text)

# Named Entity Recognition
for ent in doc.ents:
    print(ent.text, ent.label_)

# To extract relations, one basic method is to look for patterns or dependencies between entities
relations = []
for ent1 in doc.ents:
    for ent2 in doc.ents:
        if ent1 != ent2:  # Ensure they are not the same entity
            # This is a very simplistic way to establish a relation, real-world usage requires more complex logic
            relations.append((ent1.text, ent2.text))

print(relations)


Microsoft ORG
Bill Gates PERSON
Paul Allen PERSON
Redmond GPE
Washington GPE
[('Microsoft', 'Bill Gates'), ('Microsoft', 'Paul Allen'), ('Microsoft', 'Redmond'), ('Microsoft', 'Washington'), ('Bill Gates', 'Microsoft'), ('Bill Gates', 'Paul Allen'), ('Bill Gates', 'Redmond'), ('Bill Gates', 'Washington'), ('Paul Allen', 'Microsoft'), ('Paul Allen', 'Bill Gates'), ('Paul Allen', 'Redmond'), ('Paul Allen', 'Washington'), ('Redmond', 'Microsoft'), ('Redmond', 'Bill Gates'), ('Redmond', 'Paul Allen'), ('Redmond', 'Washington'), ('Washington', 'Microsoft'), ('Washington', 'Bill Gates'), ('Washington', 'Paul Allen'), ('Washington', 'Redmond')]


Tokenization

In [15]:
import nltk
from nltk.tokenize import sent_tokenize
from collections import defaultdict

# Load the NLTK tokenizer
nltk.download('punkt')

# Sample news article text
news_article_text = """
Apple Inc. reported strong quarterly results on Tuesday, bolstered by strong sales of iPhones, which the company said were up 17% from a year earlier. The tech giant posted revenue of $123.9 billion for the quarter ending December 31, beating analyst expectations. CEO Tim Cook attributed the strong performance to robust demand for the iPhone 13 lineup and continued growth in the company's services segment.

"We're thrilled with our results for the quarter, which set new records for revenue, earnings, and iPhone sales," said Cook during the earnings call with analysts. "Our customers continue to love the iPhone 13 lineup, and we're seeing strong demand across our product categories."

Apple's services segment, which includes revenue from the App Store, iCloud, and Apple Music, also saw significant growth, with revenue increasing by 15% year-over-year. The company reported double-digit growth in each of its geographic segments, with particularly strong performance in China.

Despite the positive results, Apple's stock fell slightly in after-hours trading following the earnings announcement. Analysts cited concerns about supply chain constraints and the impact of geopolitical tensions on the company's business in China as potential reasons for the dip.

Looking ahead, Apple provided guidance for the current quarter, forecasting revenue between $107 billion and $110 billion. The company said it expects continued strong demand for its products and services, despite ongoing challenges related to the global supply chain.

In addition to its financial results, Apple announced several new initiatives during the earnings call, including plans to expand its renewable energy projects and investments in workforce development programs. The company also highlighted its commitment to privacy and security, emphasizing the importance of protecting user data in the digital age.

Overall, Apple's strong performance in the latest quarter reflects the company's resilience amid challenging market conditions. With continued innovation and strategic investments, Apple remains well-positioned for future growth and success.
"""

def get_top_sentences(news_text, top_n=10):
    # Tokenize the text into sentences
    sentences = sent_tokenize(news_text)

    # Calculate the length of each sentence
    sentence_lengths = {sentence: len(sentence.split()) for sentence in sentences}

    # Rank sentences based on their lengths
    ranked_sentences = sorted(sentence_lengths.items(), key=lambda x: x[1], reverse=True)

    # Get the top N sentences
    top_sentences = [sentence[0] for sentence in ranked_sentences[:top_n]]

    return top_sentences

# Get the top sentences from the news article text
top_sentences = get_top_sentences(news_article_text)

# Output the top sentences
for i, sentence in enumerate(top_sentences, 1):
    print(f"{i}. {sentence}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


1. In addition to its financial results, Apple announced several new initiatives during the earnings call, including plans to expand its renewable energy projects and investments in workforce development programs.
2. "We're thrilled with our results for the quarter, which set new records for revenue, earnings, and iPhone sales," said Cook during the earnings call with analysts.
3. 
Apple Inc. reported strong quarterly results on Tuesday, bolstered by strong sales of iPhones, which the company said were up 17% from a year earlier.
4. Analysts cited concerns about supply chain constraints and the impact of geopolitical tensions on the company's business in China as potential reasons for the dip.
5. Apple's services segment, which includes revenue from the App Store, iCloud, and Apple Music, also saw significant growth, with revenue increasing by 15% year-over-year.
6. CEO Tim Cook attributed the strong performance to robust demand for the iPhone 13 lineup and continued growth in the comp

3. Implement a text classification application


In [9]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the 20 Newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
data = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Convert text data into numerical feature vectors using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Support Vector Machine (SVM) classifier
classifier = SVC(kernel='linear', random_state=42)
classifier.fit(X_train_tfidf, y_train)

# Predict the labels of test data
y_pred = classifier.predict(X_test_tfidf)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Accuracy: 0.9680851063829787

Classification Report:
                        precision    recall  f1-score   support

           alt.atheism       0.98      0.95      0.97       175
         comp.graphics       0.95      1.00      0.98       200
               sci.med       0.97      0.96      0.96       200
soc.religion.christian       0.97      0.95      0.96       177

              accuracy                           0.97       752
             macro avg       0.97      0.97      0.97       752
          weighted avg       0.97      0.97      0.97       752



4 . Implement a text clustering


In [11]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Sample data
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Convert text data to numerical vectors
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Clustering
true_k = 2  # Number of clusters
kmeans = KMeans(n_clusters=true_k)
kmeans.fit(X)

# Print cluster centers and associated words
print("Cluster centers:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print()

# Predicting the cluster for new samples
new_samples = ["This is a new document."]
X_new = vectorizer.transform(new_samples)
predicted = kmeans.predict(X_new)
print("New Sample belongs to Cluster:", predicted)


Cluster centers:
Cluster 0:
 document
 second

Cluster 1:
 second
 document

New Sample belongs to Cluster: [0]




5 . Implement a web crawler

In [10]:
import requests
from bs4 import BeautifulSoup
import time

def extract_links(url):
    # Send an HTTP request to the URL
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for non-200 status codes
    except requests.exceptions.RequestException as e:
        print("Error:", e)
        return []

    # Parse the HTML content of the webpage
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all anchor tags (links) in the HTML content
    links = soup.find_all('a')

    # Extract the href attribute from each anchor tag
    extracted_links = [link.get('href') for link in links if link.get('href') is not None]

    return extracted_links

def crawl_web(start_url, depth):
    visited_urls = set()
    urls_to_visit = set([start_url])
    current_depth = 0

    while current_depth <= depth:
        next_urls_to_visit = set()

        for url in urls_to_visit:
            if url not in visited_urls:
                print("Crawling:", url)
                visited_urls.add(url)

                # Extract links from the current URL
                links = extract_links(url)

                # Add new links to the set of URLs to visit next
                for link in links:
                    if link is not None and link.startswith('http'):
                        next_urls_to_visit.add(link)

        # Move to the next depth level
        urls_to_visit = next_urls_to_visit
        current_depth += 1

        # Add a delay of 1 second between requests to avoid overwhelming the server
        time.sleep(1)

# Example usage
start_url = 'https://example.com'  # Starting URL
depth = 2  # Maximum depth to crawl
crawl_web(start_url, depth)

Crawling: https://example.com
Crawling: https://www.iana.org/domains/example
Crawling: https://www.icann.org/privacy/tos
Crawling: http://www.icann.org/
Crawling: http://pti.icann.org
Error: HTTPConnectionPool(host='pti.icann.org', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7871cacecb20>: Failed to establish a new connection: [Errno 111] Connection refused'))
Crawling: https://www.icann.org/privacy/policy


6. Write an application for social media mining?


Creating a social media mining application involves collecting, processing, and analyzing data from various social media platforms. Here's a basic outline for building such an application:

Select Social Media Platforms: Choose the social media platforms from which you want to gather data. Examples include Twitter, Facebook, Instagram, LinkedIn, Reddit, etc.

Set Up API Access: Register for developer accounts and obtain API access tokens for the selected social media platforms. APIs allow you to programmatically access data from these platforms.

Define Data Sources: Determine the specific types of data you want to collect, such as posts, comments, likes, followers, etc. Define search queries or filters to retrieve relevant data.

Collect Data: Use the APIs provided by social media platforms to collect data based on your defined sources and criteria. Store the collected data in a database or file system for further processing.

Preprocess Data: Clean and preprocess the collected data by removing noise, filtering out irrelevant information, and transforming it into a format suitable for analysis.

Analyze Data: Apply various data mining and analysis techniques to gain insights from the collected data. This may include sentiment analysis, topic modeling, trend detection, network analysis, etc.

Visualize Results: Visualize the analyzed data using charts, graphs, word clouds, network diagrams, etc., to make the insights more understandable and actionable.

Implement Features: Depending on your application's objectives, implement features such as real-time monitoring, keyword tracking, user profiling, etc., to provide valuable functionalities to users.

Deploy the Application: Deploy your social media mining application on a server or cloud platform so that users can access and use it. Ensure scalability, reliability, and security of the deployed application.

Monitor and Maintain: Regularly monitor the performance of your application, address any issues or bugs that arise, and update the application with new features or improvements as needed.

7. Desicion tree

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

# Define the dataset
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(data)

# Convert categorical variables to numerical labels
label_encoders = {}
for column in df.columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# Split features and target variable
X = df.drop('PlayTennis', axis=1)
y = df['PlayTennis']

# Train Decision Tree Classifier
clf = DecisionTreeClassifier()
clf.fit(X, y)

# Define new instance to predict
new_data = {
    'Outlook': ['Sunny'],
    'Temperature': ['Hot'],
    'Humidity': ['High'],
    'Wind': ['Weak']
}
new_df = pd.DataFrame(new_data)

# Transform new instance with label encoders
for column in new_df.columns:
    new_df[column] = label_encoders[column].transform(new_df[column])

# Predict
prediction = clf.predict(new_df)
print("Prediction:", label_encoders['PlayTennis'].inverse_transform(prediction))

Prediction: ['No']
