# **Tutorial for using AI in data cleaning task**


In [None]:
file_path = '/content/extracted_tweets.csv'

In [1]:
import pandas as pd

file_path = '/content/extracted_tweets.csv'
df = pd.read_csv(file_path)

# Display the first few rows
df.head()


Unnamed: 0,Tweet ID,Tweet Text,Username
0,1887571541086584981,RT @BanKillerRobots: .@Google owner Alphabet's...,aogami_noboru
1,1887571522765594762,Showcase your #AI tool on Popular #AITools .ai...,dilxac
2,1887571520689438969,DJ Qian has a pretty straightforward way of th...,AIDeepfakenews
3,1887571515207463411,Contact me now for hacking or recovery problem...,Jamescyber113
4,1887571509050241523,RT @XYasuke: $SUI Giveawayüíß\n\nGiving away 10 ...,Degatyu


**Prompt**: describe this dataset and give me key insights

**Response**: The dataset contains 99 entries and 3 columns, with all values present (no missing data). The columns are:

Tweet ID (int64) - A unique numerical identifier for each tweet.
Tweet Text (object) - The content of the tweet.
Username (object) - The Twitter username of the account that posted the tweet.
Example entries from the dataset:

Tweet Text includes retweets, AI-related discussions, hacking/recovery services, and giveaways.
Usernames vary and represent different Twitter users.

Prompt: Act like a data analyst, were you are familiar with all cleaning tasks. give step by step process to clean this data.

Response:

Step 1: Remove Duplicates
Check for duplicate tweets based on Tweet ID and Tweet Text.
Drop duplicates if any.

In [2]:
df = df.drop_duplicates(subset=['Tweet ID'], keep='first')
df = df.drop_duplicates(subset=['Tweet Text'], keep='first')  # Optional, based on use case

In [3]:
df

Unnamed: 0,Tweet ID,Tweet Text,Username
0,1887571541086584981,RT @BanKillerRobots: .@Google owner Alphabet's...,aogami_noboru
1,1887571522765594762,Showcase your #AI tool on Popular #AITools .ai...,dilxac
2,1887571520689438969,DJ Qian has a pretty straightforward way of th...,AIDeepfakenews
3,1887571515207463411,Contact me now for hacking or recovery problem...,Jamescyber113
4,1887571509050241523,RT @XYasuke: $SUI Giveawayüíß\n\nGiving away 10 ...,Degatyu
...,...,...,...
94,1887570942739783700,When you‚Äôre a blue collar worker but you use @...,CarpeDiemElon
95,1887570941997101468,RT @endlessubmision: ìÜ©‚ô°ìÜ™ ü©∑ü§çü©µ ìÜ©‚ô°ìÜ™\nChudai\n#lov...,CJMauzano
96,1887570935596691658,The Cherry XTRFY M64 Wireless fits my hand lik...,MiyatechLabs
97,1887570934757834792,US Search people finder review https://t.co/CC...,MiyatechLabs


**Step 2: Handle Missing Values**
Since there are no missing values (as per df.info()), this step can be skipped.
However, in larger datasets, you should check using:

In [4]:
df.isnull().sum()


Unnamed: 0,0
Tweet ID,0
Tweet Text,0
Username,0


**Step 3: Normalize Text Format**
Convert Tweet Text to lowercase for consistency.
Remove extra spaces and strip leading/trailing spaces.

In [5]:
df['Tweet Text'] = df['Tweet Text'].str.lower().str.strip()


In [8]:
df['Tweet Text'][1]

'showcase your #ai tool on popular #aitools .ai: top visibility with 100k+ enthusiasts across every page.\n\nwhat‚Äôs included\n‚úÖreach your target audience\n‚úÖboost your brands credibility\n‚úÖeasy submission &amp; support\n‚úÖ10 full days of exposure\nhttps://t.co/kdgslzicop'

**Step 4: Remove Special Characters, Links, and Hashtags (Text Cleaning)**
Remove URLs (https://..., www.)
Remove mentions (@username)
Remove hashtags (#topic)
Remove non-alphanumeric characters except spaces

In [9]:
import re

def clean_text(text):
    text = re.sub(r'http\S+|www.\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

df['Tweet Text'] = df['Tweet Text'].apply(clean_text)


In [11]:
df['Tweet Text'][1]

'showcase your tool on popular ai top visibility with 100k enthusiasts across every page whats included reach your target audience boost your brands credibility easy submission amp support 10 full days of exposure'

In [12]:
print(df.info())
print(df.head())


<class 'pandas.core.frame.DataFrame'>
Index: 91 entries, 0 to 98
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Tweet ID    91 non-null     int64 
 1   Tweet Text  91 non-null     object
 2   Username    91 non-null     object
dtypes: int64(1), object(2)
memory usage: 4.9+ KB
None
              Tweet ID                                         Tweet Text  \
0  1887571541086584981  rt owner alphabets policy rollback on not usin...   
1  1887571522765594762  showcase your tool on popular ai top visibilit...   
2  1887571520689438969  dj qian has a pretty straightforward way of th...   
3  1887571515207463411    contact me now for hacking or recovery problems   
4  1887571509050241523  rt sui giveaway giving away 10 sui to 1 lucky ...   

         Username  
0   aogami_noboru  
1          dilxac  
2  AIDeepfakenews  
3   Jamescyber113  
4         Degatyu  
