# Identifiera Fake News

Mål:

Målet med projektet är att skapa en klassificeringsmodell som kan skilja mellan sanna och falska nyhetsartiklar.

Dataset:

För detta projekt kommer jag att använda datasetet "Fake News Detection" från Kaggle. Det innehåller artiklar från olika källor, märkta som antingen sanna eller falska.

Datasetets egenskaper:

* Antal rader: 20,800+
* Kolumner:
    * `id`: Unikt identifieringsnummer för artikeln
    * `title`: Titel på artikeln
    * `author`: Författare till artikeln
    * `text`: Själva artikeln
    * `label`: Klassifikation av artikeln (0 = sann, 1 = falsk)


## Ladda Data

In [2]:
import pandas as pd

train_data = pd.read_csv('dataset/train.csv')
test_data = pd.read_csv('dataset/test.csv')
submit_data = pd.read_csv('dataset/submit.csv')

print(train_data.head())
print(test_data.head())

   id                                              title              author  \
0   0  House Dem Aide: We Didn’t Even See Comey’s Let...       Darrell Lucus   
1   1  FLYNN: Hillary Clinton, Big Woman on Campus - ...     Daniel J. Flynn   
2   2                  Why the Truth Might Get You Fired  Consortiumnews.com   
3   3  15 Civilians Killed In Single US Airstrike Hav...     Jessica Purkiss   
4   4  Iranian woman jailed for fictional unpublished...      Howard Portnoy   

                                                text  label  
0  House Dem Aide: We Didn’t Even See Comey’s Let...      1  
1  Ever get the feeling your life circles the rou...      0  
2  Why the Truth Might Get You Fired October 29, ...      1  
3  Videos 15 Civilians Killed In Single US Airstr...      1  
4  Print \nAn Iranian woman has been sentenced to...      1  
      id                                              title  \
0  20800  Specter of Trump Loosens Tongues, if Not Purse...   
1  20801  Russian war

## Inspektera Data

In [3]:
train_data.info()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      5200 non-null   int64 
 1   title   5078 non-null   object
 2   author  4697 non-null   object
 3   text    5193 non-null   object
dtypes: int64(1), object(3)
memory usage: 162.6+ KB


Komplettera saknade värden inom "title", "author" och "text"

In [5]:
# training data
train_data['title'] = train_data['title'].fillna('No title')
train_data['author'] = train_data['author'].fillna('Unknown')
train_data['text'] = train_data['text'].fillna('No text')

# testing data
test_data['title'] = test_data['title'].fillna('No title')
test_data['author'] = test_data['author'].fillna('Unknown')
test_data['text'] = test_data['text'].fillna('No text')


## Preprocessing av text data

Små bokstäver: Konvertera all text till små bokstäver för att säkerställa enhetlighet.  
Ta bort specialtecken och siffror: Förenkla texten till bara ord. 

In [6]:
import re

def clean_text(text):
    text = text.lower()  # konvertera till lowercase
    text = re.sub(r'\d+', '', text)  # ta bort siffror
    text = re.sub(r'\s+', ' ', text).strip()  # ta bort extra mellanslag
    text = re.sub(r'[^\w\s]', '', text)  # ta bort specialtecken
    return text


train_data['clean_text'] = train_data['text'].apply(clean_text)
test_data['clean_text'] = test_data['text'].apply(clean_text)


Tokenisering: Bryt ner texten till enskilda ord.  

In [12]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [11]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

train_data['tokens'] = train_data['clean_text'].apply(tokenize_text)
test_data['tokens'] = test_data['clean_text'].apply(tokenize_text)


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/Users/carolina/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


Ta bort stoppord: Det här är vanliga ord (som "the", "a", "in") som troligen inte är till hjälp i analysen.  

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

train_data['filtered_tokens'] = train_data['tokens'].apply(remove_stopwords)
test_data['filtered_tokens'] = test_data['tokens'].apply(remove_stopwords)


Stamma eller lemmatisera: Reducera ord till deras bas- eller rotform.  

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized

train_data['lemmatized_tokens'] = train_data['filtered_tokens'].apply(lemmatize_tokens)
test_data['lemmatized_tokens'] = test_data['filtered_tokens'].apply(lemmatize_tokens)


Vektorisering: Konvertera den bearbetade texten till numeriska värden.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sammanfoga tokens tillbaka till strängar
train_data['final_text'] = train_data['lemmatized_tokens'].apply(lambda x: ' '.join(x))
test_data['final_text'] = test_data['lemmatized_tokens'].apply(lambda x: ' '.join(x))

# Initialisera och tillämpa TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)
X_train = vectorizer.fit_transform(train_data['final_text'])
X_test = vectorizer.transform(test_data['final_text'])

# Målvariabeln
y_train = train_data['label']


Spara bearbetad data

In [None]:
train_data.to_csv('train_processed.csv', index=False)
test_data.to_csv('test_processed.csv', index=False)
