# Free Site to Site Migration Script for Wordpress

This script will help you to migrate posts, pages including a large volume of images from one Wordpress site to another without the use of any (paid) Wordpress migration plug-ins.

**Prerequisites:**

1. You have exported the posts, pages, comments, ... from your current Wordpress site via the standard `Wordpress Tools -> Export` functionality
2. You have installed the free plugin https://wordpress.org/plugins/media-sync/ (or similar) on your target Wordpress site

**Key steps of the script:**

- You need to manually define the input file name(s) and local output folder(s) as variables at the beginning of the script (`chapter 0 - Set Variables`)
- The script automatically identifies all potential image URLs (`chapter 2 - Extract Image URLs`)
- The script downloads all identified images into the earlier specified folder (`chapter 3 - Download all Images`)
- The script modifies the original image URLs within the Wordpress export xml file and replaces it with the newly generated image file name and location (`chapter 4 - Replace Original Image URLs with new Image URLs`)

**Output after executing the script:**

- All identified images are stored locally in the folder that you specified
- A copy of the original Wordpress export xml file including the modified / new image urls is stored locally

**Todo after script has executed:**

- You must now manually upload all your downloaded images into the target folder of your target Wordpress installation that you specified at the beginning of this notebook (e.g. via FTP client)
- Import the newly generated Wordpress xml file via the standard `Wordpress Tools -> Import` functionality
- Don't forget to refresh the media library in Wordpress (e.g. using this plugin: https://wordpress.org/plugins/media-sync/)


In [146]:
import requests
import re
import shutil
import validators
import pandas as pd
import hashlib
import os

## 0 - Set Variables

First, we need to set some basic parameters to better organize the downloaded images and the modified Wordpress xml export files. In case you have a large amount of posts and images, it is highly recommended that you structure the downloads by year, or even month-year.

In [147]:
# Define your target upload folder for Wordpress
WORDPRESS_UPLOAD_FOLDER = '/wp-content/uploads/history/year-2023/'
# Define target download folder (locally) to store all detected images
IMAGE_DOWNLOAD_FOLDER = 'extracted-images/year-2023/'
# The name and path of the original wordpress export xml
wordpress_filename_original = './your-personal-export-file-from-current-wordpress-site.xml'
# The name and path of the target (modified) wordpress export xml
wordpress_filename_target = './year-2023.xml'

## 1 - Functions

The following functions are required:

- **get_url_images_in_text(text):** Returns a simple list of URLs available in the provided text
- **get_file_name(url):** Returns only the file name of the file mentioned in the given URL (strips content prior the last occurrence of character "\\")
- **construct_file_name(url):** Generates a new file name which is based on a stripped down version of the original URL. The original file name is retained and a unique hash value is added (in case the same file name exists in different folders / URLs)
- **download_file(url,file_name):** Downloads the file available under url into file_name

In [148]:
def get_url_images_in_text(text):
    url_candidates = []
    # First round of regex based cleaning:
    pattern = r'[\=,\(][\"|\'].[^\=\"]+\.(?i:jpg|gif|png|bmp|jpeg)[\"|\']'
    results = re.findall(pattern,text,re.IGNORECASE)
    for x in results:
      x = x.replace('="', '')
      x = x.replace('"', '')
      x = x.replace('\'', '')
      url_candidates.append(x)
    # Another regex check to filter out the URLs (since some slipped through the pattern above)
    urls = []
    for y in url_candidates:
      results = re.findall(r'(https?://\S+)', y)
      for x in results:
        urls.append(x)
    return urls

def get_file_name(url):
  firstpos=url.rfind("/")
  lastpos=len(url)
  return url[firstpos+1:lastpos]

def construct_file_name(url):
  hashvalue = hashlib.sha1(url.encode('utf-8')).hexdigest()
  file_name = get_file_name(url)
  return hashvalue+file_name

def download_file(url,file_name):
  if validators.url(url):
    response = requests.get(url, stream = True)
    if response.status_code == 200:
      # Create directory if it does not yet exist
      os.makedirs(os.path.dirname(file_name), exist_ok=True)
      # Write content from response to file
      with open(file_name,'wb') as f:
        shutil.copyfileobj(response.raw, f)
      print('Image sucessfully downloaded: ',file_name)
    else:
      print('Image couldn\'t be retrieved')
  else:
    print(f'URL {url} is not valid')

## 2 - Extract Image URLs

Script opens the original Wordpress export xml (stored in variable wordpress_filename_original) and identifies all relevant lines that contain image URLs. For each line, the script extracts the image related URLs. Those URLs are stored within a Pandas dataframe. The dataframe contains the following information:
- **index:** index number starting from 0
- **line:** original line number where the image URL was discovered within the Wordpress xml file
- **original_image_url:** the URL of the original image
- **new_image_file_name:** the file name of the new image
- **new_image_url:** the URL pointing to the new image

In [149]:
image_urls = []
# Words to look for in each line
words = ['.jpg','jpeg', '.png','.JPG', '.JPEG', '.PNG']
with open(wordpress_filename_original, 'r') as fp:
    for l_no, line in enumerate(fp):
        # Search if line contains any images based on the suffixes above
        for word in words:
            if word in line:
                # Extract URLs using the above custom function
                image_urls_in_line = get_url_images_in_text(line)
                # Remove potential duplicates (same URL could be mentioned multiple times in single line)
                image_urls_in_line = list(dict.fromkeys(image_urls_in_line))
                # Add list of URLS within line to the image url list
                for image_url in image_urls_in_line:
                    row = []
                    row.append(str(l_no))
                    row.append(image_url)
                    image_urls.append(row)
df_image_urls = pd.DataFrame(image_urls)
df_image_urls.columns =['line','original_image_url']
df_image_urls['new_image_file_name'] = df_image_urls['original_image_url'].apply(lambda x: construct_file_name(x))
df_image_urls['new_image_url'] = df_image_urls['new_image_file_name'].apply(lambda x: WORDPRESS_UPLOAD_FOLDER + x)

## 3 - Download all Images

All images stored in dataframe `df_image_url` are downloaded into the folder specified in variable `IMAGE_DOWNLOAD_FOLDER`.

In [150]:
for image_url in image_urls:
    file_name = construct_file_name(image_url[1])
    print(file_name)
    download_file(image_url[1], IMAGE_DOWNLOAD_FOLDER+file_name)

1a18246f360cfa8ec97b83c2935e6cce5e535ea9stilouette-gift-click-here-4.jpg
Image sucessfully downloaded:  extracted-images/all-pages/1a18246f360cfa8ec97b83c2935e6cce5e535ea9stilouette-gift-click-here-4.jpg
6238f81780e3e46d403a01b7664867d1dfe1dd65stilouette-voucher.jpg
Image sucessfully downloaded:  extracted-images/all-pages/6238f81780e3e46d403a01b7664867d1dfe1dd65stilouette-voucher.jpg
2bc638543b7760e35765db8a225c01c88e0e580dIMG_0150-1024x683.jpg
Image sucessfully downloaded:  extracted-images/all-pages/2bc638543b7760e35765db8a225c01c88e0e580dIMG_0150-1024x683.jpg
2a49286219eba226b7c5bb34f7130d39ee04866dstilouette-kids-school-extract.jpg
Image sucessfully downloaded:  extracted-images/all-pages/2a49286219eba226b7c5bb34f7130d39ee04866dstilouette-kids-school-extract.jpg
e7f2524c0e98b886af438eddebecd53e3f907138stilouette-question.png
Image sucessfully downloaded:  extracted-images/all-pages/e7f2524c0e98b886af438eddebecd53e3f907138stilouette-question.png
e7f2524c0e98b886af438eddebecd53e3f90

## 4 - Replace Original Image URLs with new Image URLs

In the final step, a new Wordpress xml file is generated that has replaced all the previous image URLs with the newly generated URLs. The file is written to the location you specified in variable `wordpress_filename_target`.

In [151]:
target_file= open(wordpress_filename_target,'w+')
# Iterate line-by-line through the original wordpress export file
with open(wordpress_filename_original, 'r') as fp:
    for l_no, line in enumerate(fp):
        df_images_in_line = df_image_urls.query(f'line == "{l_no}"')
        # Check if any changes are required for the current line
        if df_images_in_line.size > 0:
            images_list = df_images_in_line.values.tolist()
            # Replace all original images (image[1]) with the new image urls (image[3])
            for image in images_list:
                line = line.replace(image[1],image[3])
                print(f'Line {l_no} updated')
        # Write either unchanged or changed line to the new file
        target_file.write(line)
target_file.close()

Line 1072 updated
Line 1076 updated
Line 1254 updated
Line 2165 updated
Line 2238 updated
Line 2241 updated
Line 2246 updated
Line 2250 updated
Line 2417 updated
Line 2493 updated
Line 2503 updated
Line 2505 updated
Line 2507 updated
Line 2509 updated
Line 2511 updated
Line 2513 updated
Line 2515 updated
Line 2517 updated
Line 2519 updated
Line 2521 updated
Line 2523 updated
Line 2525 updated
Line 2695 updated
Line 2763 updated
Line 2767 updated
Line 2770 updated
Line 2773 updated
Line 2777 updated
Line 2780 updated
Line 2783 updated
Line 3203 updated
Line 3206 updated
Line 3210 updated
Line 3214 updated
Line 3427 updated
Line 3428 updated
Line 3631 updated
Line 3796 updated
Line 3934 updated
Line 3936 updated
Line 3938 updated
Line 3940 updated
Line 3942 updated
Line 3944 updated
Line 3946 updated
Line 3948 updated
Line 3950 updated
Line 3952 updated
Line 3954 updated
Line 3956 updated
Line 4043 updated
Line 4325 updated
Line 5673 updated
Line 5824 updated
Line 5828 updated
Line 6072 