# **Big Data Analysis in Google Colab**
This notebook provides step-by-step **Big Data Analysis** on real-world datasets using **Dask, Pandas, and Google BigQuery**.

## **1. COVID-19 Data Analysis**
**Dataset:** [Johns Hopkins COVID-19 Data](https://github.com/CSSEGISandData/COVID-19)

In [None]:
!pip install dask
import dask.dataframe as dd

# Load COVID-19 dataset
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
df_covid = dd.read_csv(url)
df_covid.head()

### **Analysis:** Identify the most affected countries

In [None]:
import matplotlib.pyplot as plt
df_grouped = df_covid.groupby("Country/Region").sum().compute()
df_grouped.iloc[:, -1].nlargest(10).plot(kind='bar', title="Top 10 COVID-19 Affected Countries")
plt.show()

## **2. NYC Taxi Trip Data Analysis**
**Dataset:** [NYC Taxi Trip Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

In [None]:
!wget https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2023-01.csv -O nyc_taxi.csv
df_taxi = dd.read_csv("nyc_taxi.csv")
df_taxi.head()

### **Analysis:** Trip distances distribution

In [None]:
df_taxi["trip_distance"].compute().hist(bins=50)
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Frequency")
plt.title("NYC Taxi Trip Distance Distribution")
plt.show()

## **3. Amazon Reviews Sentiment Analysis**
**Dataset:** [Amazon Reviews](https://nijianmo.github.io/amazon/index.html)

In [None]:
!pip install nltk wordcloud
import nltk
from wordcloud import WordCloud

url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz"
df_amazon = dd.read_json(url, lines=True)
df_amazon.head()

### **Analysis:** Word cloud of most common review words

In [None]:
text = " ".join(df_amazon["reviewText"].dropna().head(5000).compute())
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(text)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud of Amazon Reviews")
plt.show()

## **4. Twitter Data Analysis**
**Dataset:** Real-time tweets using Tweepy API

In [None]:
!pip install tweepy pandas
import tweepy
import pandas as pd

# Add Twitter API keys
consumer_key = "your_consumer_key"
consumer_secret = "your_consumer_secret"
access_token = "your_access_token"
access_token_secret = "your_access_token_secret"

# Authenticate
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)

# Search for tweets
query = "Big Data"
tweets = tweepy.Cursor(api.search_tweets, q=query, lang="en").items(100)

# Store tweets in DataFrame
df_tweets = pd.DataFrame([[tweet.user.screen_name, tweet.text, tweet.favorite_count, tweet.retweet_count] for tweet in tweets],
                         columns=["User", "Text", "Likes", "Retweets"])

df_tweets.head()

## **5. Google BigQuery Analysis**
**Dataset:** NYC Taxi Trips (Google BigQuery Public Data)

In [None]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
client = bigquery.Client()

query = """
SELECT COUNT(*) AS total_trips, AVG(trip_distance) AS avg_distance
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2015`
WHERE pickup_datetime BETWEEN '2015-01-01' AND '2015-12-31'
"""

df_bigquery = client.query(query).to_dataframe()
df_bigquery