Website Analyzer - AI/ML Fraud Detection System

Project Overview

Welcome to the Website Analyzer project! This initiative addresses the pressing need for an Automated AI/ML System to Detect and Mitigate Online Fraud. The core objective is to create and implement an AI/ML-based system that can autonomously analyze and categorize online content, distinguishing between authentic and fake/fraudulent websites, advertisements, and customer care numbers.

Project Name

Website Analyzer

Project Description

"Website Analyzer" is a sophisticated web application designed for analyzing the legitimacy of given websites. The system employs various techniques, including domain legitimacy checks, SSL certificate validation, and a powerful machine learning model known as Url Analyzer. This model predicts website legitimacy based on 17 key features. The platform goes beyond traditional analysis by incorporating two additional subsystems: Extracted URLs - Hyperlink Fetcher and NLP Content Analysis System.

Subsystems

Extracted URLs - Hyperlink Fetcher:
- This subsystem serves as a hyperlink fetcher, extracting all hyperlinks present on the input URL's page. This feature provides users with valuable insights into the interconnected web of URLs associated with a given website.
NLP Content Analysis System:
- The NLP Content Analysis System delves into the content of websites, extracting text from images using OCR (Optical Character Recognition). This NLP-driven analysis categorizes content as legitimate or potentially spammy, adding an extra layer of scrutiny to the evaluation process.

Features

Domain Legitimacy Check:
- Analyzes the expiration date of the domain to determine its legitimacy.
SSL Certificate Validation:
- Checks the validity of the SSL certificate associated with the analyzed website.
HTTPS Detection:
- Determines whether the website uses HTTP or HTTPS.
Machine Learning Prediction:
- Leverages 17 key features to predict various aspects of the analyzed website, providing insights into potential security risks and overall legitimacy.
Images Content Analysis:
- Utilizes NLP and image recognition methods to assess the authenticity and accuracy of ad content and images on the website.
Customer Care Number Database Creation:
- Implements web scraping to build a database of fraudulent customer care numbers and uses a REST API to verify incoming numbers for potential scams. (Work in process)
Feedback Form:
- Provides a feedback mechanism to enhance the extension’s accuracy and adapt to evolving fraudulent tactics.

Prerequisites

Ensure you have the following prerequisites installed:

Python
Flask
TensorFlow.js for machine learning predictions
Selenium for web scraping
Beautiful Soup for HTML parsing
Chromium-Chrome driver for automated browsing

Installation

Clone the repository:

git clone https://github.com/heygauri/30f7f52d2f8f09918a279510ed00e6b5_CyberZen_05

Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
```
python app.py
```
Open the application in your web browser: http://localhost:5000/
Enter the URL of the website you want to analyze in the provided user interface.
Click the "Analyze" button.
Review the analysis results, including domain legitimacy, SSL information, HTTPS usage, and machine learning predictions.

Future Work

Develop a browser extension integrating URL analyzer for real-time fraudulent URL detection for end users.
Scale web analyzer system by implementing recursive hyperlink analysis for input URLs.
Advance to multilingual OCRs, followed by NLP models.

Limitations

Utilizing TensorFlow in NLP model with GPU preference, currently restricted to CPUs due to resource limitations.
Url_Analyzer Model accuracy is 77%.
Ongoing efforts to improve efficiency in OCR processing.
- The comprehensive scanning of all hyperlinks on the entire website is time-consuming, primarily due to resource limitations. While not every link has been checked yet, the extracted hyperlinks enable us to assess the presence of potential phishing URLs within legitimate websites.

Contributing

Contributions are welcome! Please follow our Contribution Guidelines before submitting pull requests.

License

This project is licensed under the MIT License. Feel free to use, modify, and distribute the code as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.vscode		.vscode
__pycache__		__pycache__
images		images
model_6_savedmodel		model_6_savedmodel
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
Ad_content_analysis_NLP.ipynb		Ad_content_analysis_NLP.ipynb
CONTRIBUTING.md		CONTRIBUTING.md
Feature_extraction.ipynb		Feature_extraction.ipynb
LICENSE		LICENSE
RACCAM_PPT.pdf		RACCAM_PPT.pdf
README.md		README.md
Solution_Design.md		Solution_Design.md
ad_to_text.ipynb		ad_to_text.ipynb
app.py		app.py
random_forest_model.pkl		random_forest_model.pkl
requirements.txt		requirements.txt
result.txt		result.txt
tensorflow_nlp.ipynb		tensorflow_nlp.ipynb
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl
url_analyzer.pickle.dat		url_analyzer.pickle.dat

License

heygauri/30f7f52d2f8f09918a279510ed00e6b5_CyberZen_05

Folders and files

Latest commit

History

Repository files navigation

Website Analyzer - AI/ML Fraud Detection System

Project Overview

Project Name

Project Description

Subsystems

Features

Prerequisites

Installation

Future Work

Limitations

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Languages