This project focuses on classifying deepfake media using a machine learning model. The objective is to analyze a dataset containing metadata about various media files and predict whether they are real or fake using a Random Forest Classifier.
The dataset deepfake_detection_metadata_dataset.csv contains 1000 rows of media metadata. It includes the following features:
media_type: Image or Videocontent_category: News, Social Media, Interview, Political Speechface_count: Number of faces detected in the mediaaudio_present: Whether audio is presentlip_sync_score: Assessment of the lip sync qualityvisual_artifacts_score: Score indicating the presence of visual artifactscompression_level: Level of compression applied to the medialighting_inconsistency_score: Score evaluating lighting inconsistenciessource_platform: Social media or news platform where the media was sourcedlabel: Real or Fake
The analysis runs in a Jupyter Notebook (deepfake_analysis.ipynb) and covers the following steps:
- Data Loading and Exploration: Loading the data using pandas.
- Data Cleaning: Dropping irrelevant columns.
- Encoding: Converting categorical data into numeric values (One-Hot Encoding) and mapping the target label (Real to 0, Fake to 1).
- Feature Scaling: Preprocessing numerical features using
StandardScaler. - Model Training: Splitting the data into training (80%) and testing (20%) sets, then training a
RandomForestClassifier. - Evaluation: The model is evaluated on the test set. An initial baseline using all features performs exceptionally well, while a more realistic evaluation excluding direct predictive artifacts yields performance closer to baseline, demonstrating the challenges in deepfake detection.
- Python 3
- pandas
- scikit-learn
- Jupyter Notebook
- Activate the provided virtual environment (
venv). - Ensure required packages are installed (e.g.,
pip install pandas scikit-learn). - Start a Jupyter server and open
deepfake_analysis.ipynb. - Run the notebook cells sequentially to reproduce the workflow.