# Git:https://github.com/onlyfood/Deep-Learning-Week4.git

# Problem/Data Description:
# The data provided for this analysis is from a Kaggle competition titled "Natural Language Processing with Disaster Tweets." The objective of the competition is to classify disaster-related tweets as either referring to a real disaster (1) or not (0). The dataset consists of a training set and a test set, both containing tweet text and their corresponding labels.
# 
# The problem is of great importance as accurate classification of disaster tweets can help in timely response and effective allocation of resources during emergencies. It can also aid in monitoring and understanding public sentiment during crisis situations.
# 
# Discussion/Conclusion:
# In this analysis, we explored a dataset that contains disaster-related tweets and their labels. We performed exploratory data analysis (EDA) to gain insights into the data and understand its characteristics.
# 
# During the EDA, we examined the shape of the training and test datasets and observed the first few rows to get a sense of the data structure. We also analyzed the distribution of the target variable, which indicated the proportion of disaster-related tweets in the dataset.
# 
# Moving forward, we preprocessed the text data by applying necessary cleaning and transformation techniques. We used a CountVectorizer to convert the text data into numerical representations that can be used for modeling.
# 
# A logistic regression model was trained on the preprocessed data to classify disaster-related tweets. The model achieved a validation accuracy of X%, indicating its ability to accurately predict whether a tweet refers to a real disaster or not.
# 
# In conclusion, this analysis demonstrates the potential of natural language processing techniques to classify disaster tweets. The trained model can be applied to classify new, unseen tweets and aid in real-time emergency response and sentiment analysis during crisis situations.
# 
# Further improvements can be made by exploring advanced natural language processing models, feature engineering, and hyperparameter tuning. Additionally, the model's performance can be evaluated using additional evaluation metrics and compared with other models to identify the most effective approach for classifying disaster tweets.
# 
# By addressing this problem effectively, we can contribute to enhancing emergency response systems and providing valuable insights for disaster management and public safety.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import re
import string

In [3]:
# Load the dataset
train_data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_data = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

In [4]:
# Preprocessing function
def preprocess_text(text):
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    # Convert to lowercase
    text = text.lower()
    
    return text


In [5]:
# Apply preprocessing to the text data
train_data['preprocessed_text'] = train_data['text'].apply(preprocess_text)
test_data['preprocessed_text'] = test_data['text'].apply(preprocess_text)


In [6]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data['preprocessed_text'], train_data['target'], test_size=0.2, random_state=42)


In [7]:
# Vectorize the text data using a Bag-of-Words approach
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_val_vectorized = vectorizer.transform(X_val)


In [8]:
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)


In [9]:
# Make predictions on the validation set
val_predictions = model.predict(X_val_vectorized)

In [10]:
# Calculate accuracy on the validation set
accuracy = accuracy_score(y_val, val_predictions)
print("Validation Accuracy:", accuracy)

Validation Accuracy: 0.8023637557452397


In [11]:
# Vectorize the test data
X_test_vectorized = vectorizer.transform(test_data['preprocessed_text'])



In [12]:
# Make predictions on the test set
test_predictions = model.predict(X_test_vectorized)

In [13]:
# Prepare submission file
submission = pd.DataFrame({'id': test_data['id'], 'target': test_predictions})


In [15]:
# Save submission file
submission.to_csv('submission.csv', index=False)