# Data Collection

## Real-Time Data Collection:
Set up a pipeline to continuously collect real-time stock prices and other relevant financial data.
Use an API like yfinance, Alpha Vantage, or a direct market data provider to fetch real-time data.
Store this data in your Azure Blob Storage or a database for easy access and further analysis.

In [0]:
# Install necessary libraries
!pip install yfinance azure-storage-blob sqlalchemy pyodbc


[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting yfinance
  Using cached yfinance-0.2.40-py2.py3-none-any.whl (73 kB)
Collecting peewee>=3.16.2
  Using cached peewee-3.17.5-cp310-cp310-linux_x86_64.whl
Collecting requests>=2.31
  Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Collecting html5lib>=1.1
  Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Collecting multitasking>=0.0.7
  Using cached multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Collecting frozendict>=2.3.4
  Using cached frozendict-2.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (117 kB)
Installing collected packages: peewee, multitasking, requests, html5lib, frozendict, yfinance
  Attempting uninstall: requests
    Found existing installation: requests 2.28.1
    Not uninstalling requests at /databricks/python3/lib/python3.10/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-5924b83c-9585-4

## Fetch Realtime data

In [0]:
import yfinance as yf
import pandas as pd

def get_real_time_stock_price(stock_symbol):
    stock = yf.Ticker(stock_symbol)
    hist = stock.history(period='1d')
    if hist.empty:  # Check if the DataFrame is empty
        print(f"No data found for {stock_symbol}")
        return None  # Return None or appropriate value indicating no data
    else:
        return hist['Close'].iloc[-1]

def fetch_data():
    # List of stock symbols to fetch
    stocks = ['BANKBARODA.NS', 'HDFCBANK.NS', 'SBIN.NS', 'ICICIBANK.NS', 'AXISBANK.NS', '^BSESN', '^NSEI']
    data = {}
    for stock in stocks:
        price = get_real_time_stock_price(stock)
        if price is not None:  # Only add to data if price is not None
            data[stock] = price
    return data

# Fetch and display the data
data = fetch_data()
df = pd.DataFrame(data, index=[0])
print(df)  # Use print if display is not available

   BANKBARODA.NS  HDFCBANK.NS  ...        ^BSESN         ^NSEI
0     273.299988  1700.150024  ...  79357.492188  24108.050781

[1 rows x 7 columns]


## upload to Blob Storage

In [0]:
from azure.storage.blob import BlobServiceClient
from datetime import datetime

# Azure Storage connection string
connect_str = 'DefaultEndpointsProtocol=https;AccountName=storageriskpredictor;AccountKey=VFB3FzSHo02JqmvdOaq2Ygr2MR5Tdq+3N/O6yTeRvr2HVysRrDK8BsmTW2u4Smp7rOBZWWD/McRO+AStGLAQzQ==;EndpointSuffix=core.windows.net'
container_name = 'riskpredict-data'

# Initialize the BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connect_str)

def upload_blob(dataframe, file_name):
    # Convert DataFrame to CSV
    csv_data = dataframe.to_csv(index=False)

    # Create a BlobClient
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=file_name)

    # Upload the CSV data
    blob_client.upload_blob(csv_data, overwrite=True)

# Generate a file name based on current timestamp and upload the DataFrame
file_name = f"realtime_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
upload_blob(df, file_name)
print(f"Data uploaded to blob storage as {file_name}")


Data uploaded to blob storage as realtime_data_20240701_064619.csv
