## Current Progress on Analyzing Components of Successful Photographs

1. **Data Collection:**
   - **Reddit API:** Successfully implemented the Reddit API to collect top posts from the "photographs" subreddit.
   - **Metadata Collection:** For each post, we have collected essential metadata such as title, URL, upvotes, comments, and more.
   - **Initial Dataset:** A dataset of top posts has been created, with images ready for further analysis.

2. **Image Processing and Object Detection:**
   - **Object Detection Setup:** Pre-trained models like YOLO and Faster R-CNN have been identified as suitable for detecting objects within the photographs.
   - **COCO Dataset Use:** Using the COCO dataset for object detection, focusing on identifying components such as "person," "car," "tree," and other elements commonly found in nature and portrait photographs.
   - **Current Focus:** Current efforts are centered on fine-tuning the object detection process to accurately identify relevant components within the images.

3. **Data Analysis:**
   - **Component Identification:** Preliminary analysis on identifying key components in images (e.g., people, cars, trees) has begun.
   - **Engagement Metrics:** The relationship between identified objects and engagement metrics (like upvotes) is being explored to determine what makes a photograph successful.

4. **Challenges and Next Steps:**
   - **Fine-tuning Object Detection:** Fine-tuning the object detection model to better recognize nuanced components within the photographs (e.g., distinguishing between different types of landscapes or scenes).
   - **Handling Diverse Image Content:** Devising strategies to handle the diverse content found in the "photographs" subreddit, including nature scenes, urban landscapes, and portraits.

5. **Tools and Libraries in Use:**
   - **Python Libraries:**
      - PRAW for Reddit API interaction.
      - OpenCV and Pillow for image processing.
      - TensorFlow or PyTorch for implementing and fine-tuning deep learning models.
      - Scikit-learn for clustering and other machine learning tasks.
      - Matplotlib and Seaborn for data visualization.
   
   - **Pre-trained Models:**
      - YOLO, Faster R-CNN for object detection.
      - Potential use of models like VGG16 or ResNet for advanced image classification and embeddings.

## Next Steps:

1. **Refine Object Detection:**
   - Continue refining the object detection models to improve accuracy in identifying relevant components within photographs.

2. **Expanded Data Analysis:**
   - Deepen the analysis of how various components (e.g., presence of people, landscapes) correlate with the success of a photograph on Reddit.

3. **Visualization and Reporting:**
   - Begin visualizing findings to identify trends and present the relationship between photo components and engagement metrics.

In [121]:
import praw
import requests
from PIL import Image
from io import BytesIO
import pandas as pd
from datetime import datetime, date
from dotenv import load_dotenv
import os
import hashlib
import time

In [122]:
load_dotenv()

reddit = praw.Reddit(
    client_id=os.getenv('REDDIT_CLIENT_ID'),
    client_secret=os.getenv('REDDIT_CLIENT_SECRET'),
    user_agent=os.getenv('REDDIT_USER_AGENT'), 
)

In [123]:
subreddit = reddit.subreddit('photographs')
lim = 1000

In [124]:
posts_df = pd.DataFrame(columns=["Date Posted","Title", "Content","Comments", "URL", "Score"])

for post in subreddit.top(limit=lim):
    
    # Transform timestamp to datetime object
    post_date = datetime.utcfromtimestamp(post.created_utc)
    
    # Ensure all comments are accesible and put them in a list.
    post.comments.replace_more(limit=None)
    comments_list = [comment.body for comment in post.comments if comment.is_root]
    
    posts_df = posts_df.append({
        "Date Posted": post_date,
        "Title": post.title,
        "Content": post.selftext,
        "URL": post.url,
        "Comments": comments_list,
        "Score": post.score
    }, ignore_index=True)

In [125]:
def generate_unique_id(url):
    return hashlib.md5(url.encode()).hexdigest()

In [126]:
posts_df['Unique ID'] = posts_df['URL'].apply(generate_unique_id)

In [127]:
posts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998 entries, 0 to 997
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date Posted  998 non-null    datetime64[ns]
 1   Title        998 non-null    object        
 2   Content      998 non-null    object        
 3   Comments     998 non-null    object        
 4   URL          998 non-null    object        
 5   Score        998 non-null    object        
 6   Unique ID    998 non-null    object        
dtypes: datetime64[ns](1), object(6)
memory usage: 54.7+ KB


In [128]:
posts_df.head()

Unnamed: 0,Date Posted,Title,Content,Comments,URL,Score,Unique ID
0,2021-01-07 13:24:05,A special moment with my grandfather,,[My grandfather had not gotten his picture tak...,https://i.redd.it/g0eolj8mvw961.jpg,2433,976519df32486a271f7a6c4d926541f8
1,2022-09-11 16:13:41,CLOSER,,[I have a fascination with close up portraits ...,https://i.redd.it/p10m8ck179n91.jpg,2108,8a923d60029a980d3696765ba8b5bcc8
2,2020-10-21 13:09:30,A barn on a foggy morning in NH,,[I took this photo 2 weeks ago in New Hampshir...,https://i.redd.it/9wlc83666gu51.jpg,1999,106210dac14dd2c4dcc1e80e61080652
3,2022-11-25 15:45:39,Birds in the fog.,,[I was stuck at the office and saw the fog was...,https://i.imgur.com/BkAh5Cl.jpeg,1820,2f944b4257ea04799dca4a244e507860
4,2020-11-09 11:34:54,Glowing forest mushrooms,,[I had a go at something I've seen done a few ...,https://i.redd.it/nsehux7u97y51.jpg,1746,493f732e026330491fa6ca0b2ed56b6c


### Dowloading the images

In [129]:
# Directory to save images.
save_dir = f"{subreddit.display_name}_top_{lim}"
os.makedirs(save_dir, exist_ok=True)

# Creating quality variable to standardize the images.
image_quality = 85

# User-Agent header that mimics a request from a web browser so that the request appears more like a typical web browser request.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}


In [130]:
start_time = time.time()
unsuccessful = 0

for index, row in posts_df.iterrows():
    image_url = row['URL']
    unique_id = row['Unique ID']
    
    try:
        response = requests.get(image_url, headers=headers)
        
        # Check if the request was successful
        if response.status_code == 200:
            content_type = response.headers.get('Content-Type')
            
            # Ensure the content is an image
            if 'image' in content_type:
                image = Image.open(BytesIO(response.content))
                
                # Handle images in mode P (palette) by converting to RGB
                if image.mode == 'P':
                    image = image.convert('RGB')
                    
                # Convert RGBA to RGB if necessary
                elif image.mode == 'RGBA':
                    image = image.convert('RGB')
            
            image_filename = f"{unique_id}.jpeg"
            image.save(os.path.join(save_dir, image_filename), quality=image_quality, optimize=True)
            image.close()
        else:
            unsuccessful += 1
            print(f"Failed to download image from {image_url}")
    except Exception as e:
        unsuccessful += 1
        print(f"Error processing {image_url}: {e}")

end_time = time.time()
print(f"Images downloaded successfully in {end_time - start_time:.2f} seconds.")

Error processing https://i.imgur.com/xRVJlHD.jpg: Image size (188622756 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
Error processing https://www.flickr.com/photos/mloganphotography/50151985597/in/dateposted-public/: Operation on closed image
Images downloaded successfully in 1297.02 seconds.


In [131]:
# Saving df as a pickle file
posts_df.to_pickle("posts_df.pkl")