Skip to content

masila002/Data-analysis-web-application

Repository files navigation

πŸ“Š Data Analysis Platform

A comprehensive web-based data analysis application built with Streamlit that enables users to upload Excel/CSV files, clean data, perform exploratory data analysis, create visualizations, and export results.

Features

1. Data Upload πŸ“€

  • Support for Excel (.xlsx, .xls) and CSV files
  • Instant data preview
  • Basic statistics display
  • File information (size, memory usage)

2. Data Cleaning 🧹

  • Remove Duplicates: Eliminate duplicate rows
  • Remove Empty Columns: Delete completely empty columns
  • Trim Whitespace: Clean text data
  • Standardize Data Types: Automatic type conversion
  • Handle Missing Values: Multiple strategies
    • Drop rows
    • Fill with mean
    • Fill with median
    • Forward fill
  • Remove Outliers: IQR or Z-score methods
  • Data quality reporting

3. Exploratory Data Analysis πŸ“ˆ

  • Summary Statistics (mean, median, std, skewness, kurtosis)
  • Data Type Information
  • Missing Data Report
  • Correlation Analysis
  • Categorical Variables Summary
  • Anomaly Detection

4. Visualizations πŸ“Š

  • Histograms
  • Box Plots
  • Scatter Plots
  • Correlation Heatmaps
  • Bar Charts
  • Multiple Histograms
  • Line Charts
  • Violin Plots
  • All visualizations are interactive using Plotly

5. Export Results πŸ’Ύ

  • Download cleaned data as CSV or Excel
  • Generate and download comprehensive analysis reports
  • Export results in JSON format

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Setup Steps

  1. Clone/Navigate to Project
cd "c:\Users\Francis\Desktop\Projects\Data analysis project"
  1. Create Virtual Environment (Optional but recommended)
python -m venv venv
  1. Activate Virtual Environment

On Windows:

venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate
  1. Install Dependencies
pip install -r requirements.txt

Running the Application

Start the Streamlit App

streamlit run app.py

The application will open in your default browser at http://localhost:8501

Running on a Different Port

streamlit run app.py --server.port 8502

Project Structure

Data analysis project/
β”œβ”€β”€ app.py                          # Main Streamlit application
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ data_cleaner.py            # Data cleaning functions
β”‚   β”œβ”€β”€ data_analyzer.py           # EDA and analysis functions
β”‚   └── visualizer.py              # Visualization functions
└── README.md                       # This file

Module Documentation

data_cleaner.py

DataCleaner Class: Handles all data preprocessing operations

Methods:

  • remove_duplicates(): Remove duplicate rows
  • handle_missing_values(strategy): Handle missing values
  • remove_outliers(method, threshold): Remove outliers
  • standardize_data_types(): Standardize data types
  • remove_empty_columns(): Remove empty columns
  • trim_whitespace(): Clean text data
  • get_cleaning_report(): Get summary of operations
  • get_data_quality_report(): Get data quality metrics

data_analyzer.py

DataAnalyzer Class: Performs exploratory data analysis

Methods:

  • get_summary_statistics(): Summary statistics for numeric columns
  • get_data_types_info(): Information about data types
  • get_correlation_matrix(): Correlation matrix for numeric data
  • get_value_counts(column): Value counts for a column
  • get_categorical_summary(): Summary of categorical variables
  • detect_anomalies(method, threshold): Detect outliers/anomalies
  • get_missing_data_report(): Report on missing data
  • get_eda_report(): Comprehensive EDA report

visualizer.py

DataVisualizer Class: Create interactive visualizations

Methods:

  • plot_histogram(column, nbins): Create histogram
  • plot_box_plot(column): Create box plot
  • plot_scatter(x_col, y_col, color_col): Create scatter plot
  • plot_correlation_heatmap(): Correlation heatmap
  • plot_bar_chart(column, top_n): Bar chart for categories
  • plot_multiple_histograms(columns, nbins): Multiple histograms
  • plot_line_chart(x_col, y_cols): Line chart (time series)
  • plot_violin_plot(y_col, x_col): Violin plot

Usage Workflow

Step 1: Upload Data

  1. Navigate to "πŸ“€ Upload Data" page
  2. Click "Choose file" and select your Excel or CSV file
  3. Preview your data and review basic statistics

Step 2: Clean Data

  1. Navigate to "🧹 Data Cleaning" page
  2. Apply cleaning operations:
    • Start with removing duplicates and empty columns
    • Handle missing values (choose appropriate strategy)
    • Remove outliers if needed
    • Standardize data types
  3. Monitor the cleaning report to track changes

Step 3: Analyze Data

  1. Navigate to "πŸ“ˆ Exploratory Analysis" page
  2. Choose analysis type:
    • View summary statistics
    • Check data types and missing values
    • Analyze correlations
    • Review categorical summaries
    • Detect anomalies

Step 4: Visualize

  1. Navigate to "πŸ“Š Visualizations" page
  2. Create interactive charts:
    • Explore distributions with histograms
    • Check relationships with scatter plots
    • View correlations with heatmaps
    • Compare categories with bar charts

Step 5: Export

  1. Navigate to "πŸ’Ύ Export Results" page
  2. Download options:
    • Cleaned data (CSV or Excel)
    • Complete analysis report (JSON)

Data Cleaning Best Practices

Order of Operations

  1. Remove Duplicates - First step
  2. Remove Empty Columns - Before analysis
  3. Trim Whitespace - For text data
  4. Handle Missing Values - Choose strategy based on data
  5. Remove Outliers - After understanding the data
  6. Standardize Data Types - Last step

Missing Values Strategies

  • Drop: Best for small missing data (<5%)
  • Mean: For normally distributed numeric data
  • Median: For skewed numeric data (robust to outliers)
  • Forward Fill: For time series data

Outlier Detection

  • IQR Method: Good for symmetric distributions (1.5 is standard)
  • Z-Score: Good for normally distributed data (3 is standard)

Example Use Cases

  1. Sales Data Analysis

    • Upload sales data
    • Clean date formats and remove duplicates
    • Analyze trends and correlations
    • Visualize sales by region/product
  2. Survey Data Processing

    • Upload survey responses
    • Clean and standardize responses
    • Analyze demographics
    • Generate visualizations for reports
  3. Financial Data

    • Upload transaction data
    • Remove duplicates and handle missing values
    • Analyze spending patterns
    • Export for further analysis

Troubleshooting

Issue: "No module named 'streamlit'"

Solution: Run pip install -r requirements.txt

Issue: "File not found" when uploading

Solution: Ensure file path is correct and file is in Excel or CSV format

Issue: Visualization not displaying

Solution: Check that you have numeric columns for the selected visualization type

Issue: Memory error with large files

Solution:

  • Consider filtering data first
  • Use the data cleaning features to reduce size
  • Upload in chunks if possible

Performance Tips

  1. Large Files (>500MB)

    • Use CSV instead of Excel (faster)
    • Clean and filter data early
    • Consider chunking the analysis
  2. Many Columns (>100)

    • Use multiselect to focus on relevant columns
    • Generate correlation heatmaps selectively
  3. Slow Visualizations

    • Reduce number of data points
    • Simplify visualization types
    • Clear browser cache

Dependencies

Package Version Purpose
streamlit 1.28.1 Web framework
pandas 2.1.3 Data manipulation
numpy 1.24.3 Numerical computing
openpyxl 3.11.0 Excel file support
scikit-learn 1.3.2 Machine learning utilities
plotly 5.18.0 Interactive visualizations
scipy 1.11.4 Statistical functions
seaborn 0.13.0 Statistical visualization
matplotlib 3.8.2 Plotting library

Future Enhancements

  • Advanced statistical tests (t-test, ANOVA, etc.)
  • Machine learning model training
  • Predictive analysis
  • Custom filtering and queries
  • Data merging capabilities
  • API integration
  • Real-time data updates
  • Advanced report generation (PDF, DOCX)
  • Data validation rules
  • Automated data quality scoring

License

This project is provided as-is for educational and commercial use.

Support

For issues or questions, please refer to:

Author

Data Analysis Platform v1.0 Built with ❀️ using Streamlit


Version: 1.0 Last Updated: November 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages