A comprehensive web-based data analysis application built with Streamlit that enables users to upload Excel/CSV files, clean data, perform exploratory data analysis, create visualizations, and export results.
- Support for Excel (.xlsx, .xls) and CSV files
- Instant data preview
- Basic statistics display
- File information (size, memory usage)
- Remove Duplicates: Eliminate duplicate rows
- Remove Empty Columns: Delete completely empty columns
- Trim Whitespace: Clean text data
- Standardize Data Types: Automatic type conversion
- Handle Missing Values: Multiple strategies
- Drop rows
- Fill with mean
- Fill with median
- Forward fill
- Remove Outliers: IQR or Z-score methods
- Data quality reporting
- Summary Statistics (mean, median, std, skewness, kurtosis)
- Data Type Information
- Missing Data Report
- Correlation Analysis
- Categorical Variables Summary
- Anomaly Detection
- Histograms
- Box Plots
- Scatter Plots
- Correlation Heatmaps
- Bar Charts
- Multiple Histograms
- Line Charts
- Violin Plots
- All visualizations are interactive using Plotly
- Download cleaned data as CSV or Excel
- Generate and download comprehensive analysis reports
- Export results in JSON format
- Python 3.8 or higher
- pip (Python package manager)
- Clone/Navigate to Project
cd "c:\Users\Francis\Desktop\Projects\Data analysis project"- Create Virtual Environment (Optional but recommended)
python -m venv venv- Activate Virtual Environment
On Windows:
venv\Scripts\activateOn macOS/Linux:
source venv/bin/activate- Install Dependencies
pip install -r requirements.txtstreamlit run app.pyThe application will open in your default browser at http://localhost:8501
streamlit run app.py --server.port 8502Data analysis project/
βββ app.py # Main Streamlit application
βββ requirements.txt # Python dependencies
βββ modules/
β βββ data_cleaner.py # Data cleaning functions
β βββ data_analyzer.py # EDA and analysis functions
β βββ visualizer.py # Visualization functions
βββ README.md # This file
DataCleaner Class: Handles all data preprocessing operations
Methods:
remove_duplicates(): Remove duplicate rowshandle_missing_values(strategy): Handle missing valuesremove_outliers(method, threshold): Remove outliersstandardize_data_types(): Standardize data typesremove_empty_columns(): Remove empty columnstrim_whitespace(): Clean text dataget_cleaning_report(): Get summary of operationsget_data_quality_report(): Get data quality metrics
DataAnalyzer Class: Performs exploratory data analysis
Methods:
get_summary_statistics(): Summary statistics for numeric columnsget_data_types_info(): Information about data typesget_correlation_matrix(): Correlation matrix for numeric dataget_value_counts(column): Value counts for a columnget_categorical_summary(): Summary of categorical variablesdetect_anomalies(method, threshold): Detect outliers/anomaliesget_missing_data_report(): Report on missing dataget_eda_report(): Comprehensive EDA report
DataVisualizer Class: Create interactive visualizations
Methods:
plot_histogram(column, nbins): Create histogramplot_box_plot(column): Create box plotplot_scatter(x_col, y_col, color_col): Create scatter plotplot_correlation_heatmap(): Correlation heatmapplot_bar_chart(column, top_n): Bar chart for categoriesplot_multiple_histograms(columns, nbins): Multiple histogramsplot_line_chart(x_col, y_cols): Line chart (time series)plot_violin_plot(y_col, x_col): Violin plot
- Navigate to "π€ Upload Data" page
- Click "Choose file" and select your Excel or CSV file
- Preview your data and review basic statistics
- Navigate to "π§Ή Data Cleaning" page
- Apply cleaning operations:
- Start with removing duplicates and empty columns
- Handle missing values (choose appropriate strategy)
- Remove outliers if needed
- Standardize data types
- Monitor the cleaning report to track changes
- Navigate to "π Exploratory Analysis" page
- Choose analysis type:
- View summary statistics
- Check data types and missing values
- Analyze correlations
- Review categorical summaries
- Detect anomalies
- Navigate to "π Visualizations" page
- Create interactive charts:
- Explore distributions with histograms
- Check relationships with scatter plots
- View correlations with heatmaps
- Compare categories with bar charts
- Navigate to "πΎ Export Results" page
- Download options:
- Cleaned data (CSV or Excel)
- Complete analysis report (JSON)
- Remove Duplicates - First step
- Remove Empty Columns - Before analysis
- Trim Whitespace - For text data
- Handle Missing Values - Choose strategy based on data
- Remove Outliers - After understanding the data
- Standardize Data Types - Last step
- Drop: Best for small missing data (<5%)
- Mean: For normally distributed numeric data
- Median: For skewed numeric data (robust to outliers)
- Forward Fill: For time series data
- IQR Method: Good for symmetric distributions (1.5 is standard)
- Z-Score: Good for normally distributed data (3 is standard)
-
Sales Data Analysis
- Upload sales data
- Clean date formats and remove duplicates
- Analyze trends and correlations
- Visualize sales by region/product
-
Survey Data Processing
- Upload survey responses
- Clean and standardize responses
- Analyze demographics
- Generate visualizations for reports
-
Financial Data
- Upload transaction data
- Remove duplicates and handle missing values
- Analyze spending patterns
- Export for further analysis
Solution: Run pip install -r requirements.txt
Solution: Ensure file path is correct and file is in Excel or CSV format
Solution: Check that you have numeric columns for the selected visualization type
Solution:
- Consider filtering data first
- Use the data cleaning features to reduce size
- Upload in chunks if possible
-
Large Files (>500MB)
- Use CSV instead of Excel (faster)
- Clean and filter data early
- Consider chunking the analysis
-
Many Columns (>100)
- Use multiselect to focus on relevant columns
- Generate correlation heatmaps selectively
-
Slow Visualizations
- Reduce number of data points
- Simplify visualization types
- Clear browser cache
| Package | Version | Purpose |
|---|---|---|
| streamlit | 1.28.1 | Web framework |
| pandas | 2.1.3 | Data manipulation |
| numpy | 1.24.3 | Numerical computing |
| openpyxl | 3.11.0 | Excel file support |
| scikit-learn | 1.3.2 | Machine learning utilities |
| plotly | 5.18.0 | Interactive visualizations |
| scipy | 1.11.4 | Statistical functions |
| seaborn | 0.13.0 | Statistical visualization |
| matplotlib | 3.8.2 | Plotting library |
- Advanced statistical tests (t-test, ANOVA, etc.)
- Machine learning model training
- Predictive analysis
- Custom filtering and queries
- Data merging capabilities
- API integration
- Real-time data updates
- Advanced report generation (PDF, DOCX)
- Data validation rules
- Automated data quality scoring
This project is provided as-is for educational and commercial use.
For issues or questions, please refer to:
- Streamlit Documentation: https://docs.streamlit.io
- Pandas Documentation: https://pandas.pydata.org/docs
- Plotly Documentation: https://plotly.com/python
Data Analysis Platform v1.0 Built with β€οΈ using Streamlit
Version: 1.0 Last Updated: November 2025