A comprehensive web scraping and data analysis toolkit that combines financial data collection from Yahoo Finance with social media sentiment analysis from Reddit.
- Automated Stock Data Collection: Scrapes top stock gainers from Yahoo Finance
- Portfolio Analysis: Comprehensive analysis of stock performance and trends
- Data Visualization: Professional charts and graphs for portfolio insights
- Historical Data: Fetches and analyzes historical price data using yfinance
- Robust Error Handling: Handles dynamic web content and potential scraping challenges
- Social Media Data Mining: Collects posts and comments from specified subreddits
- Sentiment Analysis Ready: Structured data collection for sentiment analysis workflows
- Flexible Data Export: Exports data in CSV format for further analysis
- Rate Limiting Compliance: Implements proper API usage patterns
web_scrapping/
├── Code/
│ ├── web_scraping_yahoo.ipynb # Yahoo Finance scraper and analysis
│ └── redit_api.ipynb # Reddit API data collection
├── Output/
│ ├── close_prices.csv # Historical stock price data
│ ├── top_gainers_50.csv # Top 50 stock gainers
│ ├── reddit_posts_data.csv # Reddit posts data
│ ├── reddit_comments_data.csv # Reddit comments data
│ └── portfolio_visual.png # Portfolio visualization
├── requirements.txt # Python dependencies
├── credentials_template.py # Template for API credentials
├── .gitignore # Git ignore rules (protects credentials)
├── LICENSE # MIT License
└── README.md # This file
- Python 3.8 or higher
- Chrome browser (for Selenium web scraping)
- Reddit API account (for Reddit data collection)
git clone https://github.com/gsaco/web_scrapping.git
cd web_scrappingpip install -r requirements.txt- Visit Reddit Apps
- Create a new application (choose "script" type)
- Copy
credentials_template.pytocredentials.py - Fill in your Reddit API credentials in
credentials.py:client_id = "your_reddit_client_id_here" client_secret = "your_reddit_client_secret_here" user_agent = "your_app_name_here/1.0 by your_reddit_username"
jupyter notebook- Open
Code/web_scraping_yahoo.ipynb - Run all cells to:
- Scrape top stock gainers from Yahoo Finance
- Collect historical price data
- Generate portfolio analysis and visualizations
- Export results to CSV files
Key Features:
- Dynamic content handling with Selenium WebDriver
- Automatic ChromeDriver management
- Comprehensive error handling and retries
- Portfolio performance visualization
- Ensure your
credentials.pyfile is properly configured - Open
Code/redit_api.ipynb - Run all cells to:
- Connect to Reddit API
- Collect posts from specified subreddits
- Gather comments data
- Export structured data for analysis
Key Features:
- OAuth2 authentication with Reddit API
- Configurable subreddit targeting
- Structured data export (CSV format)
- Rate limiting compliance
| File | Description |
|---|---|
top_gainers_50.csv |
Top 50 stock gainers with symbols and company names |
close_prices.csv |
Historical closing prices for analyzed stocks |
reddit_posts_data.csv |
Reddit posts with metadata (score, comments, timestamps) |
reddit_comments_data.csv |
Reddit comments with sentiment analysis ready format |
portfolio_visual.png |
Portfolio performance visualization chart |
This project implements comprehensive credential protection:
- All sensitive data is excluded via
.gitignore - Template file provided for easy setup
- Environment variables support
- No hardcoded credentials in source code
Protected Files:
credentials.py- Reddit API credentials*.envfiles - Environment variablesconfig.json- Configuration files- API keys and tokens
The project implements robust error handling for:
- Network connectivity issues
- API rate limiting
- Dynamic web content changes
- Missing data scenarios
- Browser compatibility issues
- Stock performance tracking
- Historical price analysis
- Portfolio diversification metrics
- Trend identification
- Risk assessment indicators
- Sentiment analysis ready data structure
- Temporal analysis capabilities
- User engagement metrics
- Content popularity tracking
- Subreddit comparative analysis
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Yahoo Finance for providing accessible financial data
- Reddit API (PRAW) for social media data access
- Selenium for robust web scraping capabilities
- yfinance for financial data integration
- pandas & matplotlib for data analysis and visualization
Gabriel Saco
- GitHub: @gsaco
- Project Link: https://github.com/gsaco/web_scrapping